LDC requires two pieces of documentation with every corpus submission: a README to be included with the published corpus and a list of checksums for verification to ensure that the submission was received by LDC in its entirety without corruption. Some examples of optional documentation may include: annotation guidelines, descriptions of data collection methodology, and metadata tables associated with the corpus. We welcome conference or journal papers written about the resource as part of the documentation, but they are not an acceptable substitute for a README.
LDC’s preferred format for a README is UTF-8 encoded plain text; however, PDF and Microsoft Word formats are also accepted. Please note that Word documents will be converted to PDF for publication. In broad terms, the goal of the README should be to answer the following three questions about the corpus:
1) What is it?
2) Who can use it?
3) How can it be used?
The answers to these questions should be aimed at a generic user who does not have the specialized knowledge of the corpus developer/author.
In addition to a general overview of the corpus, the README must cover the points below. The necessary information may be presented in brief bulleted sections and/or included in a larger narrative document at the provider’s discretion. Where this information is not known, the README should indicate such. Additional documents explaining any of the categories below in further detail are acceptable to augment the README.
Corpus titles should be descriptive and inform potential users about the nature of the publication. The title should encapsulate such aspects as the data type, the project name where appropriate, and the language(s). To avoid redundancy, the words "corpus" and "data" should not be used. The title should also indicate the version of the corpus, where appropriate. If the publication is part of a series, that should also be noted. LDC may propose a title change to increase understandability or to conform to naming standards.
Authors must be listed by providers. LDC leaves it to providers to decide who should receive author credit and in what order. As a general rule, LDC considers those contributors who assume a sustained decision-making role in corpus design, structure or content to be authors.
Languages should be included with an approximate breakdown of the different languages by percentage if possible. LDC uses the ISO 639-3 standard for language names and abbreviations; this is the standard for published metadata in the LDC Catalog. Languages exist in many local varieties and in larger clusters called macrolanguages for closely-related languages. Providers should use the most specific designation available for any given language or dialect. In the event that a language is not contained in the ISO 639-3 standard or only an imperfect mapping exists, this should be indicated.
Recommended/Expected use of corpus
Listing the research and technical applications for a corpus helps users identify useful and relevant corpora in the LDC catalog.
Collection Procedure - format, method, and timespan
This includes information about the nature of the source data, the source name and when the material was produced and/or collected. For data collected from human subjects, providers should also include any demographic information and related metadata to enhance the usability of the resource.
Directory Structure & File Format Specific Details
A general overview of the corpus's directory structure should be included, along with information about each different file type. This should cover both its function in the corpus as well as technical details. For all types this should include its format. For text specifically, useful information includes the encoding and script. For audio and video data, this includes the number of speakers (total and unique), the number of hours, percentage and level of any transcription and participant demographic information. For any included tools, installation and usage instructions should be provided.
LDC’s preferred form for checksums is md5; however, sha1 or other formats are accepted. On Linux- based systems, these can be generated with the following command, where “root” is the root directory of one’s corpus:
find root -type f | xargs md5sum > md5s.txt
On MacOS, the command is similar:
find root -type f | xargs md5 > md5s.txt
In Windows PowerShell, the following command will generate checksums or a third- party application such as Hash My Files can be used.
Get-FileHash -Algorithm MD5 -Path (Get-ChildItem "root\*.*" -Recurse) > md5s.txt