Publication Inquiries & Delivering Data
Publication inquires should be made using the LDC submissions form. Include the following information:
- Primary contact person (name, email address, phone number)
- Title of publication including version number if applicable.
- A list of language and dialect names used in the corpus. When possible, use the language names and ids in the ISO 639-3 standard.
- Data Type (speech, text, lexicon, video or other).
- Estimated delivery date.
- Size: a rough estimate of the publication size in bytes, hours of speech, words of text or tokens as appropriate. At a minimum, priovide the data size in bytes.
- Format: this should describe the audio formats, markup schemes and/or video formats in the submission.
- Origin and genre of data used in the publication as well as collection methodologies.
- A brief narrative description or abstract detailing the nature of the publication, its applications and other relevant information.
- A representative sample of the data.
Once a publication proposal has been accepted, LDC staff works with providers to facilitate delivery of the data. LDC performs extensive quality assurance on all publications to ensure published data is complete, error free and ready to use. The following outlines expectations regarding delivered data.
General GuidelinesAll publications should be "camera ready" in terms of quality. LDC staff will perform basic quality control tests on all delivered data to ensure file integrity and audio/video quality, to validate XML and SGML markup and to verify text encoding. Specific standards are detailed below. LDC strongly recommends that providers use open data standards where possible to ensure the compatibility, accessibility and longevity of their works.
Corpus authors inform LDC of any specific deadlines, dates or events that affect the publication schedule for their submissions. LDC will make reasonable efforts to coordinate publication accordingly, but requires at least 60 days lead time between delivery and publication.
Submissions should be delivered in an orderly directory structure. The following is suggested for most forms of data:
Delivered data should include files that are readable to all users. Providers should use only ASCII characters for filenames and should avoid punctuation marks other than underscores, "_" and periods, ".". Meaningful filename extensions, such as .txt for text or .xml for XML files, should also be used. Additionally, authors should strive to limit the directory depth of their submissions to seven since ISO 9660 file system for CDs and DVDs limits the depth of directories to that number. Complex publications may be packaged using tarred zip (tgz) files. Again, due to the limitations of optical media, LDC requests that providers keep their file sizes under 2 Gigabytes, restrict their path lengths to no more than 1023 characters (assuming 1 byte characters in ASCII) limit the number of files in any directory to 5000 or less.
LDC accepts deliveries in many formats and media. For small publications, download or upload is the preferred method although CDs or DVDs are also acceptable. Larger releases may be delivered on CD or DVD. If a corpus exceeds six DVDs, consider delivery on a hard drive or blu-ray DVD.
LDC generally accepts audio data in three formats: NIST Sphere, MS WAV and MP3. Other Resource Interchange File Format (RIFF) files such as Apple's AIFF may also be acceptable. Providers should be aware that MP3 is a lossy audio format which may not preserve speech data adequately for all research purposes.The following information is required for audio submissions:
- Sample rate (44,100 Hz, 44,056 Hz, 22,050 Hz, 11,025 Hz, 8000 Hz)
- Sample format (Linear PCM, a-law, u-law)
- Sample size (8 bit, 16 bit)
- Compression type (flac, shorten)
In addition to this basic descriptive information, providers should describe their recording environment and methodologies. These descriptions include information such as:- Microphones used
- Recording setup (computer type, sound card, tape)
- Setting (meeting, field, studio)
- Circuit information for telephone collections
- Cellular or line
- DS0, T1, VOIP
- Pbx information
LDC accepts video data. Because video is more complex than audio and requires more effort to publish, providers must furnish information both about the video file container and the underlying codec. The video container specifies how the data is stored while the codec provides the means to uncompress the video for replay, LDC accepts any of the following video containers:- avi
- mpeg-ps (vob)
In some cases, Adobe Flash Video may be acceptable as well. Documents that embed video, such as manuals, will often use Adobe Flash Video. The data in a file container may use any one of a variety of codecs. Acceptable data codecs are:- mpeg1
For best results, video should have a 1.15 Mbps constant bit rate (CBR) target bit rate. Frame rate and frame size may vary according to recording conditions. Broadcast video data must conform to the National Television System Committee (NTSC) or Phase Alternating Line (PAL). These standards determine frame rate, frame sizes and other aspects of video images. NTSC is the most commonly used standard in North America and is the preferred standard for data delivery.
Text data comes in many forms and addresses many different research needs. Text may be as simple as unstructured documents or as complex as treebanks and propbanks. A text-based publication can have varying degrees of structure depending on the research task it addresses. Plain text submissions must follow these rules:- Text must be encoded as Unicode UTF-8 or UTF-16 whenever possible. ASCII, a subset of the UTF-8 Latin-1 character set, will suffice for most English text collections. If possible, text should be normalized to one of the Unicode canonical normalization forms. Providers should avoid nonstandard, proprietary fonts and character encodings.
- End of line characters must be consistent throughout the corpus. Unix style new line characters are preferred. Form feed characters should not be used. The last lines in all files must be terminated with an end of line.
Markup languages like XML or SGML can structure text data and annotations. XML is the preferred format, but SGML is also acceptable. Markup languages should only provide structure to documents; they should not determine appearance or limit methods for data processing. For that reason, HTML will only be accepted in works that address the processing and parsing of web pages. As with plain text, XML and SGML formatted data should use Unicode encoding and normalization forms. Both XML and SGML data will require a means of quality control to ensure well formed and valid documents. Document Type Definitions (dtd) will serve for both XML and SGML. LDC also accepts W3C and Relax NG Schemas.
Established vocabularies exist within the larger XML language. These widely understood standard specifications often prove helpful for structuring data and ensuring readability among a wide variety for users. Examples of these vocabularies include the News Industry Text Format(NITF) , NiteXML, AGTK, TimeML and Text Encoding Initiative (TEI). When such XML vocabularies are used, providers must include the specification or link to that specification. When authors develop their own vocabulary for structuring their annotations or other work, they should also provide a complete description as part of the corpus documentation.
All delivered SGML and XML will be tested against its dtd or schema to verify that it is well formed, complete and valid. LDC will also verify the encoding of all text submissions.