Filename Conventions & Metadata
LDC Filename Conventions
LDC observes several file naming conventions to ensure interoperability and to minimize access problems. When naming data, documentation and software files, providers should follow these guidelines:
- Filenames may not begin with a period or any other punctuation symbol. These often have significance as quantifiers and parameters to commands in various operating systems. Beginning a file with such a symbol can cause an operating system to ignore the file or treat it as a hidden file.
- Filenames may not contain spaces. Spaces in filenames cause problems for Unix/Linux systems and automated processing scripts in many environments. Substitute an underscore, “_” for spaces.
- Filenames should contain only ASCII characters.
- Unaccented letters (A-Z, a-z)
- Digits (0-9)
- Use only the punctuation marks period, underscore, plus sign, tilde and hyphen (._+~-). Other punctuation marks may be interpreted as control characters in some operating systems.
- Do not use the punctuation marks, /,\,?,*,:,|,",‘,< and >. These characters are often interpreted by operating systems as commands or parameters to commands.
- Filenames should have meaningful extensions such as “.txt” for text files, “.dtd” for dtd files and so on. Providers should assign meanings to extensions that are at odds with their commonly understood meanings. For instance, a file with a '.wav' extension is commonly understood to be a Microsoft Wave file. Similarly, '.tgz' is commonly understood to be a Unix tar file that has been compressed with Gnu Zip.
- Providers should avoid using reserved words as file names. Such filenames will be invisible or can display unpredictable behavior on any operating system that reserves them. The following is an incomplete list of common reserved words:
- AUX (aux)
- CON (con)
- COM1 (com1)
- COM2 (com2)
- COM3 (com3)
- COM4 (com4)
- LPT1 (lpt1)
- LPT2 (lpt2)
- LPT3 (lpt3)
- NUL (nul)
- PRN (prn)
Metadata and Documentation
Metadata and dcoumentation are essential for usability. The following represents LDC's preferred best practices.
Corpus titles should be descriptive and should inform potential users about the nature of publication. The title should encapsulate the data type, the project name where appropriate, and the language(s). The words "corpus" and "data" should not be used in most titles since it would be redundant. The name should indicate the version of the corpus. If the publication is part of a series, then that should also be noted. Examples include:
- Chinese Proposition Bank 2.0
- GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
- NPS Internet Chatroom Conversations, Release 1.0
Authors must be listed by providers. LDC leaves it to the providers to decide who should receive author credit. A a general rule, LDC considers those contributors who assume a sustained decision-making role in corpus design, structure or content to be authors.
LDC sorts corpora into categories based on data type.
Speech: Any corpus that primarily contains speech or other audio data including meeting speech, conversational telephone speech, studio recordings, field recordings and so on. Nontext data derived from speech also falls into this category,such as Mel Frequency Cepstrum Coefficients. LDC typically splits audio and transcripts into separate releases.
Text: Any corpus that primarily contains text data. This may include raw transcripts, annotated transcripts, monolingual text, parallel text, treebanks, propbanks and ngram models. Text may be structured with a markup language such as XML or SGML, or it may be delivered raw.
Lexicon: Lexicons include traditional dictionaries, pronouncing lexicons, translation lexicons, named entity lists, glossaries, concordances and other forms that detail vocabulary, grammar and/or syntax.
Video: Any corpus that has significant visual component falls into the video category.
The original source of the raw data must be clearly indicated. Provide the name of the source, the dates the material was produced and/or collected and whether the authors have reached any previous agreement with the source provider. Authors should indicate when human subjects were used and provide demographic information as appropriate. Such information can include gender, age, nation of origin, native language or educational level. This information is often invaluable for research into spoken language.
Listing the research and technical applications for a corpus helps users identify useful and relevant corpora in the LDC catalog. Whenever possible, authors should provide information about the potential uses of their data.
Languages and Dialects
LDC uses the ISO 639-3 standard for language names and abbreviations. This ISO standard aims to cover all known natural languages, including dialects, and builds on the previous 639-2 standard. The vocabulary also includes many ancient languages and constructed languages. Providers who cannot find their language(s) in the vocabulary should provide a brief description of the language including its geographic distribution and alternate names.
Languages exist in many local varieties and in larger clusters called macrolanguages that group closely related languages. Providers should use the most specific designation available for any given language or dialect. For instance, a corpus of Spoken Arabic from the Gulf States should be listed as Gulf Arabic [afb]. In cases of finer dialect distinction, providers should describe the dialect in the narrative description that accompanies the corpus. When a corpus represents a specific vernacular, providers must describe the cultural context in which this dialect exists. Likewise, some varieties are specific to a certain period of time. When this is the case, corpus providers should describe the time periods in which the dialect was used and a brief history.
A good narrative description describes the corpus in a qualitative way. Such descriptions are fairly short, typically no more than three paragraphs. Corpora breaking new ground or presenting novel data may require lengthier descriptions. The first paragraph of the description acts as a synopsis and summarizes the corpus. A good synopsis addresses these questions:
What it is: Provide a brief description of the type of publication including the language(s), tasks and data
Who can use it: Briefly describe how the data who might be interested in the data.
How it can be used: Describe the tasks and applications for which the corpus is suitable.
Language sketches are helpful in putting a corpus in context. Language sketches should follow the synopsis. Any of the so-called under-resourced languages will likely require a fairly detailed description as will dictionaries and lexicons. A good description includes the region in which the language is spoken, who speaks the language, what distinguishes them and the important distinguishing features of language. Potential users of the corpus may also want to know the salient features of the language's literature and/or history.
Some corpora are directed at a specific research or development task. In general, language corpora are developed to address some extant problem. What is that problem, or, phrased in another way, what is the application? How will this corpus be used?
- What is the task?
- What the purpose of the task?
- What are the components of the task? What are the components of the corpus?
Lastly providers should describe their data in detail. For audio or video data, the recording conditions must be described in as much detail as possible. Of particular importance are the equipment and methods used in recording and digitization. Also included should be the sample rate, sample format, number of channels, video codecs, frame rate, frame size and file container as appropriate. Beyond this, providers should describe the setting in which the recordings were made.
For text, providers should describe the encoding and normalization form, markup and particular features of the corpus text format. Transcription and annotation methodologies must also be described.
The final piece of information in the description is the span of time encompassed by the data collection. Providers developing work from broadcast, newswire and other media must specify the dates on which the original material was first released (broadcast, published or posted). For providers collecting their own data, specifying the start and end dates of their work should be sufficient.