Submitting Corpora and Other Resources to the LDC
Note that we are experiencing ongoing but isolated problems with the Publication Inquiry form. Please contact us if you think that your submission was not received.
The Linguistic Data Consortium (LDC) supports language-related education, research and technology development by creating and sharing language resources. These resources include data, tools and standards. As part of its mission, LDC ensures that its published resources reach a broad spectrum of users – students, scholars, researchers, developers – in academic, governmental and private organizations. These communities require access to data across languages, genres and formats. Thus LDC continually diversifies its Catalog of language resources. Towards this end, LDC invites contributions from outside authors.
Resources distributed by LDC reach a global audience. All published resources appear in LDC’s online Catalog, which is accessed daily by users worldwide. LDC’s monthly newsletter keeps the community abreast of all new publications, and its reach ensures the attention of interested researchers. LDC members receive copies of the corpora as part of their membership benefits. LDC’s Membership structure therefore guarantees your data greater exposure to major organizations working in human language technologies(HLT) and related fields.
The LDC Corpus Catalog contains a variety of resources in many languages and formats ranging from written to spoken and video. Speech and video data may derive from broadcast collections, interviews, and recordings of telephone conversations. Text data comes from a variety of sources including newswire, document archives and anthologies as well as the World Wide Web. LDC also publishes dictionaries and lexicons in a variety of languages. While LDC develops many of the resources it offers in its catalog, almost half of its offerings come from external contributors. For that reason, LDC welcomes outside contributions of all types of linguistic resources. In order to coordinate delivery of data and provide for review, LDC asks corpus authors to submit an inquiry prior to submitting data. Authors and developers wishing to contribute resources to LDC should follow these publication guidelines.
Since publishing linguistic resources entails a significant commitment of time and effort, LDC asks that all authors submit inquiries that describe the proposed publication. All inquiries should use the LDC Publication Inquiry form and should should include the following information:
- Primary contact person
- Email address
- Phone number
- Title of publication including version number when applicable
- A list of language and dialect names used in the corpus. When possible, use the Language names and ids in the ISO 639-3 standard.
- Data Type (speech, text, lexicon, video or other).
- Estimated delivery date
- Size of publication, a rough estimate of the publication size in bytes, hours of speech, words of text or tokens as appropriate. At a minimum, LDC will need to know the data size in bytes.
- Information about the format of the publication. This should describe the audio formats, markup schemes and/or video formats in the submission.
- Origin and genre of data used in the publication as well as collection methodologies.
- A brief narrative description or abstract detailing the nature of the publication, its applications and other relevant information.
- A representative sample of the data.
Publication inquiries to LDC are made through the Publication Inquiry form.
Once a publication proposal has been accepted, LDC staff will work with providers to facilitate the delivery of their data. The LDC performs extensive quality assurance on all publications to ensure published data is complete, error free and ready to use. The following policies outline the LDC's expectations regarding delivered data.
All publications should be "camera ready" in terms of quality. LDC staff will perform basic quality control tests on all delivered data to ensure file integrity, audio/video quality, validate XML and SGML markup and verify text encoding. The specific standards are detailed below. LDC strongly recommends that providers use open data standards where possible as their use will ensure the compatibility, accessibility and longevity of their works.
Corpus authors should let LDC staff know if there are any specific deadlines, date or events that affect the publication schedule for their submissions. LDC will make reasonable efforts to coordinate publication of submissions with such dates, but will require at least sixty days lead time between delivery and publication. Submissions should be delivered in an orderly directory structure. The following is suggested for most forms of data:
When delivering data, authors must ensure that all files are readable to all users. Providers should use only ASCII characters for filenames and should avoid punctuation marks other than underscores, "_" and periods, ".". Meaningful filename extensions, such as .txt for text or .xml for XML files, should also be used. Additionally, authors should strive to limit the directory depth of their submissions to seven since ISO 9660 file system for CDs and DVDs limits the depth of directories to that number. Complex publications may be packaged using tarred zip(tgz) files. Again, due to the limitations of optical media, LDC requests that providers keep their file sizes under 2 Gigabytes and restrict their path lengths to no more than 1023 characters(assuming 1 byte characters in ASCII). More detailed filename recommendations may be further down in this document. LDC also recommends that providers limit the number of files in any directory to 5000 or less. Very large numbers of files in directories can affect both directory and file access in many systems.
LDC accepts deliveries in many formats and media. For small publications, download or upload is the preferred method although CDs or DVDs are also acceptable. Larger releases may be delivered on CD or DVD. If a delivery exceeds six DVDs in length, authors should consider delivery on a hard drive or Blu-Ray DVD.
LDC generally accepts audio data in three formats: NIST Sphere, MS WAV and MP3. Other Resource Interchange File Format (RIFF) files such as Apple's AIFF may also be acceptable. Providers should be aware that MP3 is a lossy audio format which may not preserve speech data adequately for all research purposes.Authors must describe their audio data in detail. At a minimum, LDC will need to know the following:
- Sample rate (44,100 Hz, 44,056 Hz, 22,050 Hz, 11,025 Hz, 8000 Hz)
- Sample format (Linear PCM, a-law, u-law)
- Sample size (8 bit, 16 bit)
- Compression type (flac, shorten)
In addition to this basic descriptive information, providers should describe their recording environment and methodologies. These descriptions include information such as:
- Microphones used
- Recording setup(computer type, sound card, tape)
- Setting (meeting, field, studio)
- Circuit information for telephone collections
- Cellular or line
- DS0, T1, VOIP
- Pbx information
LDC accepts video data. However, as video is more complex than audio and requires more effort to publish. Providers must furnish information both about the video file container and the underlying codec. The video container specifies how the data is stored while the codec provides the means to uncompress the video for replay, LDC accepts any of the following video containers.
- mpeg-ps (vob)
In some cases, Adobe Flash Video may be acceptable as well. Documents which embed video, such as manuals, will often use Adobe Flash Video. The data in a file container may use any one of a variety of codecs. Acceptable data codecs are:
For best results, video should have a 1.15 Mbps Constant Bit Rate (CBR) target bit rate. Frame rate and frame size may vary according to recording conditions. Broadcast video data must conform to the National Television System Committee (NTSC) or Phase Alternating Line (PAL) . These standards determine frame rate, frame sizes and other aspects of video images. NTSC is the most commonly used standard in North America and is the preferred standard for data delivery.
Text data comes in many forms and addresses many different research needs. Text may be as simple as unstructured documents or as complex as treebanks and propbanks. A text based publication can have varying degrees of structure depending the research task it addresses. Plain text submissions must follow these rules:
- Text must be encoded as Unicode UTF-8 or UTF-16 whenever possible. ASCII, a subset of the UTF-8 Latin-1 character set, will suffice for most English text collections. If possible, text should be normalized to one of the Unicode canonical normalization forms. Providers should avoid nonstandard, proprietary fonts and character encodings.
- End of line characters must be consistent throughout the corpus. Unix style new line characters are preferred. Form feed characters should not be used. The last lines in all files must be terminated with an end of line.
Markup languages like XML or SGML can structure text data and annotations. XML is the preferred format, but SGML is also acceptable. Markup languages should only provide structure to documents; they should not determine appearance or limit methods for data processing. For that reason, HTML will only be accepted in works that address the processing and parsing of web pages. As with plain text, XML and SGML formatted data should use Unicode encoding and normalization forms. Both XML and SGML data will require a means of quality control to ensure well formed and valid documents. Document Type Definitions (dtd) will serve for both XML and SGML. LDC also accepts W3C and Relax NG Schemas.
Established vocabularies exist within the larger XML language. These widely understood standard specifications often prove helpful for structuring data and ensuring readability among a wide variety for users. Examples of these vocabularies include the News Industry Text Format(NITF) , NiteXML, AGTK, TimeML and Text Encoding Initiative (TEI). When such XML vocabularies are used, providers must include the specification or link to that specification. When authors develop their own vocabulary for structuring their annotations or other work, they should also provide a complete description as part of the corpus documentation.
All delivered SGML and XML will be tested against its dtd or schema to verify that it is well formed, complete and valid. LDC will also verify the encoding of all text submissions.
While LDC imposes few restrictions on filenames and directory structures, there are a set of best practices that should be followed. These practices will ensure that all files in a corpus can be distinguished and accessed regardless of host operating system. Providers should, in general, attempt to limit their directory depth to seven levels. LDC delivers most finished publications on CD or DVD. The ISO 9660 CD file system used on most CDs and DVDs imposes a total depth limit of seven levels. Publications with more complex hierarchies are still permissible, but will be written to disc as a compressed tar file. Providers should give some thought about how to hide the complexity of their publications in this event. Additionally, the ISO file system imposes a limit of 2 Gb on the size of individual files. Providers should avoid files larger than this limit. The ISO file system also imposes a hard limit of 1023 characters on pathnames including filenames. Filenames may not exceed 180 characters.
LDC observes several file naming conventions to ensure interoperability and to minimize access problems. When naming data, documentation and software files, providers should follow these guidelines:
- Filenames may not begin with a period or any other punctuation symbol. These often have significance as quantifiers and parameters to commands in various operating systems. Beginning a file with such a symbol can cause an operating system to ignore the file or treat it as a hidden file.
- Filenames may not contain spaces. Spaces in filenames cause problems for Unix/Linux systems and automated processing scripts in many environments. Substitute an underscore, “_” for spaces.
- Filenames should contain only ASCII characters.
- Unaccented letters (A-Z, a-z)
- Digits (0-9)
- Use only the punctuation marks period, underscore, plus sign, tilde and hyphen (._+~-). Other punctuation marks may be interpreted as control characters in some operating systems.
- Do not use the punctuation marks, /,\,?,*,:,|,",,< and >. These characters are often interpreted by operating systems as commands or parameters to commands.
- Filenames should have meaningful extensions such as “.txt” for text files, “.dtd” for dtd files and so on. Providers should assign meanings to extensions that are at odds with their commonly understood meanings. For instance, a file with a '.wav' extension is commonly understood to be a Microsoft Wave file. Similarly, '.tgz' is commonly understood to be a Unix tar file that has been compressed with Gnu Zip.
- Providers should avoid using reserved words as file names. Such filenames will be invisible or can display unpredictable behavior on any operating system that reserves them. The following is an incomplete list of common reserved words:
- AUX (aux)
- CON (con)
- COM1 (com1)
- COM2 (com2)
- COM3 (com3)
- COM4 (com4)
- LPT1 (lpt1)
- LPT2 (lpt2)
- LPT3 (lpt3)
- NUL (nul)
- PRN (prn)
In order to describe a publication in the LDC's catalog, ensure usability and to promote use of published resources, providers must deliver complete documentation with their data. The following guidelines represent LDC's own best practices and will help outside providers develop robust documentation and detailed metadata.
Corpus titles should be descriptive and should inform potential users about the nature of publication. The title should encapsulate the data type, the project name where appropriate, and the language(s). The words "corpus" and "data" should not be used in most titles since it would be redundant. The name should indicate the version of the corpus. If the publication is part of a series, then that should also be noted. Examples include:
- Chinese Proposition Bank 2.0
- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
- NPS Internet Chatroom Conversations, Release 1.0
Authors must be listed by providers. LDC leaves it to the providers to decide just who should receive author credits on their work as they are in the best position to make the determination. A a general rule, LDC considers those contributors who assume a sustained decision making in corpus design, structure or content to be authors.
The LDC sorts corpora into categories based on type of data. Providers must identify the primary type of data contained in the corpus.
- Speech: Any corpus that primarily contains speech or other audio data including meeting speech, conversational telephone speech, studio recordings, field recordings and so on. Non text data derived from speech also falls into this category. An example would be a corpus of Mel Frequency Cepstrum Coefficients. The LDC typically splits the transcripts and annotations into a separate text release.
- Text: Any corpus that primarily contains text data. This may include raw transcripts, annotated transcripts, monolingual text, parallel text, treebanks, propbanks, ngram models as well. Text with any sort of annotation falls into this category. Text may be structured with a markup language such as XML or SGML, or it may be delivered raw.
- Lexicon: Lexicons include traditional dictionaries, pronouncing lexicons, translation lexicons, named entity lists, glossaries, concordances and other forms which detail vocabulary, grammar and/or syntax.
- Video: Any corpus that has significant visual component falls into the video category.
The original source of the raw data must be clearly indicated. LDC must know the name of the source, the dates the material was produced and/or collected and whether the authors have reached any previous agreement with the source provider. Authors should indicate when human subjects were used and provide demographic information as appropriate. Such information can include gender, age, nation of origin, native language or educational level. This information is often invaluable for research into spoken language.
Listing the research and technical applications for a corpus helps users identify useful and relevant corpora in the LDC catalog. Whenever possible, authors should provide information about the potential uses of their data.
LDC uses the ISO 639-3 standard for language names and abbreviations. This ISO standard aims to cover all known natural languages including dialects and builds on the previous 639-2 standard. The vocabulary also includes many ancient languages and constructed languages. Providers who cannot find their language(s) in the vocabulary will need to provide brief a description of the language including its geographic distribution and alternate names. Languages exist in many locally varying varieties and in larger clusters called macrolanguages that group closely related languages. Providers should use the most specific designation available for any given language or dialect. For instance, a corpus of Spoken Arabic from the Gulf States should be listed as Gulf Arabic[afb]. In cases of finer dialect distinction, providers should describe the dialect in the narrative description that accompanies the corpus. When a corpus represents a specific vernacular, the providers must describe the cultural context in which this dialect exists. Likewise, some varieties are specific to a certain period of time. When this is the case, corpus providers should describe the time periods in which the dialect was used and a brief history.
A good narrative description describes the corpus in a qualitative way. The descriptions are fairly short; typically no more than three paragraphs. Some corpora which break new ground or present novel data may need lengthier descriptions. Lexicons and dictionaries in particular will require a great deal of description. The first paragraph acts as synopsis and summarizes the corpus. A good synopsis will address these questions
- What it is: Provide a brief description of the type of publication including the language(s), tasks and data
- Who can use it: Briefly describe how the data who might be interested in the data.
- How it can be used: Describe the tasks and applications for which the corpus is suitable.
Language sketches are helpful in putting the corpus in context. If needed, Language Sketches should follow the synopsis. Any of the so called Less Commonly Taught Languages will need a fairly detailed description to help put the corpus in context. Dictionaries and Lexicons of all types will likewise require a detailed description. A good description will describe the region in which the language is spoken, who speaks the language what distinguishes them, and summarize the important distinguishing features of language. Potential users of the corpus may also want to know the salient features of the language's literature and/or history.
Some corpora are directed at a specific research or development task and need some explanation of that. In general, language corpora are developed to address some extant problem. What is that problem, or, phrased in another way, what is the application? How will this corpus be used?
- What is the task?
- What the purpose of the task?
- What are the components of the task? What are the components of the corpus?
Lastly providers will need to describe their data in detail. When the provider collects audio or video data, the recording conditions must be described in as much detail as possible. Of particular importance are the equipment and methods used in recording and digitization. Specifically, providers must provide the sample rate, sample format, number of channels, video codecs, frame rate, frame size and file container as appropriate. Beyond this, providers should describe the setting in which the recordings were made. For text, providers should describe the encoding and normalization form, markup and particular features of the corpus’ text format. Transcription and annotation methodologies must also be described. The final piece information needed in the data description is the span of time encompassed by the data collection. Providers developing work from broadcast, newswire and other media must specify the dates which the original material was first released(broadcast, published or posted). For those providers collecting their own data, specifying the start and end dates of their work should be sufficient.