Once a publication proposal has been accepted, LDC staff works with providers to facilitate delivery of the data. LDC performs extensive quality assurance on all publications to ensure published data is complete, error free and ready to use. The following outlines expectations regarding submitted corpora.
All publications should be "camera ready" in terms of quality. LDC staff will perform basic quality control tests on all delivered data to ensure file integrity and data quality, to validate markup and to verify text encoding. Specific standards are detailed below. LDC strongly recommends that providers use open data standards where possible to ensure the compatibility, accessibility and longevity of their works.
Submissions should be delivered in an orderly directory structure. The structure below is suggested for most forms of data, with the root folder named to reflect the title of the corpus. If a different structure is required, that structure and the reasoning behind it should be included when submitting the corpus. LDC may require that the structure be changed for compatibility with Consortium standards and best practices for digital resources.
Providers should use only unaccented ASCII alphanumeric characters for filenames and should avoid punctuation marks other than underscores “_”, dashes “-”, and periods, “.”. Spaces should never be present in any file names. Meaningful and common-use filename extensions, such as .txt for text or .xml for XML files, should also be used.
In general, LDC prefers data in the following formats:
- All text in UTF-8 encoding and markup as XML.
- Audio as 8kHz or 16kHz, 8-bit or 16-bit, FLAC-compressed MS-WAV.
- Video as AVI or MP4.
- Tools and Software in a thoroughly accessible, documented form.
LDC’s preferred text encoding is Unicode UTF-8. ASCII, a subset of the UTF-8 Latin-1 character set, will also suffice for most English text collections. LDC accepts a wide range of text file formats; below are some guidelines for the most common, plain text and XML.
Plain text files should be normalized to one of the Unicode canonical normalization forms. Providers should avoid non-standard, proprietary fonts and character encodings. End-of-line characters must be consistent throughout the corpus. Unix style new line characters are preferred. Form feed characters should not be used. The last lines in all files must be terminated with an end of line.
XML is LDC’s preferred markup format, and a DTD or schema against which all files validate must be provided. For more information on XML, DTDs and schema, w3schools’ resources are recommended.
Providers should avoid proprietary formats such as those used in Microsoft Office. Tabular data is more appropriately provided in comma or tab separated value text files. Excel files are acceptable when needed to utilize the increased level of functionality present over plain text csv files. Documentation included in Microsoft Word should be converted to PDF and should be avoided for data files.
LDC’s preferred audio data format is FLAC-compressed MS-WAV (RIFF), with a sample rate of 8 or 16kHz and sample size of 8 or 16-bit. MS-WAV, MP3. Other file formats are also acceptable as are other audio attributes when appropriate for a specific corpus. Whenever possible, all audio data should be provided in a uniform format with consistent attributes. When audio data is presented in a non-preferred format, such as preserving an evaluation format, providers should indicate the reason when submitting the corpus.
LDC’s preferred video data formats are MPEG-4 (mp4) and Audio Video Interleave (AVI). Other video formats such as QuickTime (mov) are also acceptable. All video data should be provided in a uniform format with consistent attributes.
Broadcast video data must conform to ATSC (Advanced Television Systems Committee) or Digital Video Broadcasting - Satellite – First or Second Generation (DVB-S/S2) standards. The frame rate, frame size, and aspect ratio are determined by the data stream; the recommended standard is 1080p (HD). For streaming formats from internet sources, the preferred format as specified by the service provider should be used.
Tools & Software
Providers may include in a published corpus any tools and relevant software provided that thorough documentation as to their installation and execution are provided. Because LDC’s published resources should be usable by the broad community, tools that merely convert data to a different format should be omitted in favor of providing the produced data.