![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
||||
|
|
Linguistic Resources Data Providers and Corpus Authors The Linguistic Data Consortium, and the research communities it serves, benefit from the generosity of data providers and corpus authors. Data providers are organizations like television and radio broadcasters and news agencies that allow their products to be used for research purposes. Many researchers who produce corpora of linguistic data for their own use also agree to make their work available through the consortium so that it can benefit a wider audience. This page providers information for both data providers and corpus authors. Please note that one does not need to be a Member of the LDC in order to provide data or have an authored corpus distributed by the LDC. Data Providers The mission of the Linguistic Data Consortium (LDC) is to support language research and education by providing language resources including data, tools and standards. As an activity of the University of Pennsylvania, the LDC is a tax-exempt charitable organization under § 501(c)(3) of the U.S. tax code. Donations of data or other resources to the LDC may be eligible as tax deductions. Language data is a broad term. Any substantial body of information rendered in a human language can serve as language data. Language scholars often collect specialized databases of language behavior such as interviews, customer service interactions or lectures. However, others sources such as radio and television broadcasts, news wires, web sites, books, magazines, newpapers, court transcripts and even telephone conversations are equally appropriate and certainly more plentiful. To support language researchers, the LDC collects, annotates and distributes all of these types of material. Individual researchers have very different needs for the data LDC provides. In speech recognition, engineers use spoken data and accompanying transcripts to build models that relate the acoustic characteristics of the spoken word with its representation in writing. In information retrieval, researchers look for indicators within a document to indicate its relevance to a specific user query. In language learning, teachers search databases of text and speech to locate examples of specific words, pronunciations or grammatical constructions. One of LDC's most important roles since 1992 has been to act as an intermediary of intellectual property rights. Because LDC staff understand the needs of both the research community and the publishers of information, many information publishers are willing to provide LDC with data. LDC has established fruitful working relationships with over 70 information publishers including major television networks, cable television companies, commercial and public radio broadcast companies, newswire agencies, government information bureaus and private news oriented web sites. Information providers find that LDC's uses help their product reach new market segments. LDC member organizations become aware of the products, their quality and volume characteristics and the publisher's charitable activities by using LDC data. However, LDC's uses do not undermine normal consumption of information. LDC's packaging and distribution of information products for research purposes is incompatible with their normal use as information products. For example, although the University of Pennsylvania is an LDC member with all rights and priviledges, organizations elsewhere in the University that use information products, the Library, the University radio station, the general council's office maintain their own subscriptions to their services regardless of the fact that LDC also receives this material. That's because information consumers need the material in real time and in a specific format with appropriate viewing software. The timeline, format and tools used for education, research and development are different and incompatible. As a intermediary between information providers and researchers LDC reduces the demand on provider's external relations staff. In place of dozens of miscellaneous requests for donations of data, LDC providers deal exclusively with LDC. LDC communicates the provider's restrictions to LDC members and handles any usage agreements that may be necessary. Given LDC's experience with information providers, LDC standard agreements incorporate reasonable protection of the providers' rights. LDC data supports research and development in important areas of current speech technology including: automatic speech recognition, speech synthesis, informtation retieval, language teaching. Leading reserchers in each of these areas are LDC members. Ready access to large bodies of language data contribute to improvements in technology that ultimately benefit everyone especially the original information providers. In some cases LDC information providers do not archive their own data, or else archive it in an inadequate format. In those cases, LDC becomes the best, or only, source of archival information. In such cases LDC offers free copies of digital data archives back to the original providers. Please contact our Intellectual Property group, if you would like become an LDC data provider. Corpus Authors The LDC has corpora from authors in academia, government, and private enterprise and across many different data types and structure. We have four major objectives for the corpora we release:
If you are interested in having the LDC publish your corpus, please contact us at ldc@ldc.upenn.edu with:
After we receive an initial inquiry with the information above, we will review it and make an initial determination if the LDC can publish the corpus. After this determination, which can take between a week and a month, we will contact the primary contact person and set up a schedule for delivery of the data to the LDC as well as any other interim dates, such as delivery of documentation, IPR agreements, or quality control methods. To provide data to the LDCWhen the data is delivered to the LDC, we will need the following information submitted along with the data: The raw data structured at the top level in the following directories:
Note:
The LDC will schedule the release of the corpus, in agreement with the author(s), and will make every effort to publish the corpus on time. However, we are dependent upon corpus authors for the timely delivery of data and information and, generally, are not capable of providing assistance beyond packaging, production, and distribution of the corpora. The actual release of the publication is sometimes rescheduled, due to the urgency of other corpora. The LDC exists in order to provide for dissemination of linguistic data and we look forward to working with anyone who might wish to have us publish their data. |
|||
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
Contact ldc@ldc.upenn.edu |
||||