Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources
Data Providers and Corpus Authors

The Linguistic Data Consortium, and the research communities it serves, benefit from the generosity of data providers and corpus authors. Data providers are organizations like television and radio broadcasters and news agencies that allow their products to be used for research purposes. Many researchers who produce corpora of linguistic data for their own use also agree to make their work available through the consortium so that it can benefit a wider audience. This page providers information for both data providers and corpus authors.

Please note that one does not need to be a Member of the LDC in order to provide data or have an authored corpus distributed by the LDC.

Data Providers

The mission of the Linguistic Data Consortium (LDC) is to support language research and education by providing language resources including data, tools and standards. As an activity of the University of Pennsylvania, the LDC is a tax-exempt charitable organization under § 501(c)(3) of the U.S. tax code. Donations of data or other resources to the LDC may be eligible as tax deductions.

Language data is a broad term. Any substantial body of information rendered in a human language can serve as language data. Language scholars often collect specialized databases of language behavior such as interviews, customer service interactions or lectures. However, others sources such as radio and television broadcasts, news wires, web sites, books, magazines, newpapers, court transcripts and even telephone conversations are equally appropriate and certainly more plentiful. To support language researchers, the LDC collects, annotates and distributes all of these types of material.

Individual researchers have very different needs for the data LDC provides. In speech recognition, engineers use spoken data and accompanying transcripts to build models that relate the acoustic characteristics of the spoken word with its representation in writing. In information retrieval, researchers look for indicators within a document to indicate its relevance to a specific user query. In language learning, teachers search databases of text and speech to locate examples of specific words, pronunciations or grammatical constructions.

One of LDC's most important roles since 1992 has been to act as an intermediary of intellectual property rights. Because LDC staff understand the needs of both the research community and the publishers of information, many information publishers are willing to provide LDC with data. LDC has established fruitful working relationships with over 70 information publishers including major television networks, cable television companies, commercial and public radio broadcast companies, newswire agencies, government information bureaus and private news oriented web sites.

Information providers find that LDC's uses help their product reach new market segments. LDC member organizations become aware of the products, their quality and volume characteristics and the publisher's charitable activities by using LDC data.

However, LDC's uses do not undermine normal consumption of information. LDC's packaging and distribution of information products for research purposes is incompatible with their normal use as information products. For example, although the University of Pennsylvania is an LDC member with all rights and priviledges, organizations elsewhere in the University that use information products, the Library, the University radio station, the general council's office maintain their own subscriptions to their services regardless of the fact that LDC also receives this material. That's because information consumers need the material in real time and in a specific format with appropriate viewing software. The timeline, format and tools used for education, research and development are different and incompatible.

As a intermediary between information providers and researchers LDC reduces the demand on provider's external relations staff. In place of dozens of miscellaneous requests for donations of data, LDC providers deal exclusively with LDC. LDC communicates the provider's restrictions to LDC members and handles any usage agreements that may be necessary. Given LDC's experience with information providers, LDC standard agreements incorporate reasonable protection of the providers' rights.

LDC data supports research and development in important areas of current speech technology including: automatic speech recognition, speech synthesis, informtation retieval, language teaching. Leading reserchers in each of these areas are LDC members. Ready access to large bodies of language data contribute to improvements in technology that ultimately benefit everyone especially the original information providers.

In some cases LDC information providers do not archive their own data, or else archive it in an inadequate format. In those cases, LDC becomes the best, or only, source of archival information. In such cases LDC offers free copies of digital data archives back to the original providers.

Please contact our Intellectual Property group, if you would like become an LDC data provider.

Corpus Authors

The LDC has corpora from authors in academia, government, and private enterprise and across many different data types and structure. We have four major objectives for the corpora we release:

  • The data must fulfil some need within the research community
  • The LDC must have Intellectual Property Rights (IPR) agreements with the data provider
  • The data must be internally consistent in presentation and representation, or if that is not possible or desireable, then the level of quality control must be noted in the documentation
  • The LDC must provide sufficient documentation to permit our customers to determine the useability of the data and to understand the various components of the corpus

If you are interested in having the LDC publish your corpus, please contact us at

ldc@ldc.upenn.edu with:
  1. Name of the corpus (which is very important in itself, but particularly if there are other similarly named corpora) and whether this will be a multiple version corpus
  2. The name(s) of any source data provider(s) as well as any other persons or organizations who may have an IPR interest in the corpus
  3. Name of the Project (if any) for which the corpus was developed
  4. Rough size of the corpus (in K or MB; hours of speech or video data; number of unique words and total number of words for text, etc.)
  5. Description of the corpus and suggested use(s) (two to three paragraphs are generally sufficient)
  6. Information on when and where the corpus might be needed (for example, if the published corpus will be needed at a conference in a specific month, we can make an effort to ensure that the corpus will be released by then)
  7. Primary contact person (both email and telephone number for a single point of contact)
  8. Sometimes we will need a sample of the data

After we receive an initial inquiry with the information above, we will review it and make an initial determination if the LDC can publish the corpus. After this determination, which can take between a week and a month, we will contact the primary contact person and set up a schedule for delivery of the data to the LDC as well as any other interim dates, such as delivery of documentation, IPR agreements, or quality control methods.

To provide data to the LDC

When the data is delivered to the LDC, we will need the following information submitted along with the data:

The raw data structured at the top level in the following directories:

  1. doc
  2. data (which might be structured in speech and text, depending on the corpus; the author can further structure the data as fit, but our preference is to have the doc/data division at the top level)
  3. other special directories, when necessary (such as a "dtd" directory)
  1. doc directory

    The doc directory should contain a readme file including the following information:

    1. publication title, including version number
    2. authors
      Authorship Guidelines: "Any person who was significantly involved in the planning or execution of the corpus should be listed as author in decreasing order of involvement. This would include the corpus planners, the people who did the production work, as well as the people who coordinated the annotation and/or wrote the specification. It does not necessarily include simple annotation or transcription or the efforts of other personnel who essentially did work for hire."
      -to be specified: name and email addresses of all authors, and one email address for the users to contact, should they have any questions
    3. data type (text, speech, video, etc)
    4. data sources (broadcast, newswire, newspaper, microphone, telephone, etc.), years of data collection and collection procedures
      -to be specified: newswire or broadcast providers, technical specifications for microphone or telephone, etc.
    5. project (examples: ACE, ATIS, Communicator, DARPA, EARS, HUB4, HUB5, LID, MUC, RM, SID, SPINE, Talkbank, TDT, TIDES, Tipster, TREC, other) where applicable and purpose of the project and of the task; what was accomplished through the corpus (and what remains to be further accomplished)
    6. applications (examples: automatic content extraction, cross-lingual information retrieval, discourse analysis, gesture recognition, gesture synthesis, information detection, information extraction from video, information retrieval, language identification, language modeling, language teaching, machine translation, message understanding, natural language processing, parsing, pronunciation modeling, prosody, speaker identification, speaker verification, speech recognition, speech synthesis, spoken dialogue systems, tagging, topic detection and tracking, other).
    7. languages
    8. special license, if applicable (discussed with the IPR Coordinator)
    9. grant number and funding agency, if applicable
    10. copyright - Specify the copyright statement (Portions © year source...). The copyright must be worked out in advance with the IPR coordinator.
    11. description of the corpus structure and data attributes:
      • data type (text, speech, video, etc.) and file formats
      • number of files, size of the data (if compressed, size of the data before and after compression, and utility used for compression)
      • for text: file format, character encoding, number of unique words, total number of words, size in (K, M or G)bytes
      • for speech (and video): file format, channel count, sampling rate, sampling format, number of hours, size in MB
        -it is helpful if it is specified how to play audio and video files and what software is required.
        Most of our speech corpora are published in NIST SPHERE format, and some in MS WAV (RIFF) format, but it's become part of the LDC practice to publish audio files primarily in sphere format, so it is helpful if you submit them as such..
      • the contents of every directory have to be described in the readme file
      • URL to the project page, for additional documentation or tools, etc. (if applicable)
    12. quality control: A method of quality control. For XML and SGML files a DTD or W3C schema is essential. For any unique tools that are necessary to use the data, we'll need a description of the tool, the tool itself or a means for obtaining the tool. We prefer to have the tool provided with the data, but a download URL for the tool will suffice. For the data itself, we will need to know what kind of processing or validation was applied on the data, what it aims to accomplish, and problems or issues with the data, if known.

    The doc directory should also contain any relevant papers, annotation guidelines, data collection procedures, other files (speaker demographics, etc), etc.

  2. data directory - contains the data files, in an appropriate structure

  3. dtd directory - where applicable, if there are one or more DTD or schema files (or other special directories).

Note:
All data submitted must meet the following specifications:
-permissions: all files should be submitted as 664 (rw-rw-r--), and directories as 775 (rwxrwxr-x).
-there shouldn't be any empty files or directories; if there are any empty files or directories necessary for the distribution, there should be a note of explanation about them in the readme.txt file.
-make sure there aren't any ~ or . files accidentally created and incorporated into the publication
-make sure that all the links work
-there shouldn't be files named like this: filename.txt and Filename.txt, where the only difference between the 2 filenames is in the case
-file/directory names may not contain spaces (use underscore "_") and may not contain any non-ASCII characters

The LDC will schedule the release of the corpus, in agreement with the author(s), and will make every effort to publish the corpus on time. However, we are dependent upon corpus authors for the timely delivery of data and information and, generally, are not capable of providing assistance beyond packaging, production, and distribution of the corpora. The actual release of the publication is sometimes rescheduled, due to the urgency of other corpora.

The LDC exists in order to provide for dissemination of linguistic data and we look forward to working with anyone who might wish to have us publish their data.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 09-Oct-2007 16:33:35 EDT
© 1992-2009 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.