What's New:

LDC will be closed in observance of the Memorial Day Holiday on Monday, May 30.  We will resume normal business hours on Tuesday, May 31. 

April 2022 marks the beginning of LDC’s 30th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and preserves language resources for research, education and technology development. The Catalog continues to grow, housing over 900 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 200,000 copies of data to more than 6,000 organizations worldwide. We are sincerely grateful to the community, and we pledge to continue the mission to provide diverse data, high-quality member services and research program support. 

Stay tuned for upcoming newsletter highlights from the last three decades! 
LDC is releasing Ukrainian data it developed in the DARPA AIDA program, the NIST Language Recognition Evaluation series and the DARPA LORELEI program under a special no-cost, limited license for disaster and refugee relief research. 

These resources are available in three corpora:
 
LDC2022E06     AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2020T24     LORELEI Ukrainian Representative Language Pack
LDC2020T10     LORELEI Entity Detection and Linking Knowledge Base

For further information about these data sets and licensing terms, see Disaster and Refugee Relief Research.

LDC Submissions is a platform that provides infrastructure and resources for sharing data through the Catalog. After registering for a user account, corpus submitters can create a submission, upload files, and communicate with LDC’s publications team during the review process. After all reviews are complete, the final, release-ready version of your data set is uploaded to the platform and enters the publications queue. 

Sharing your corpus through LDC ensures access to the global research community and the permanent preservation of your data according to best practices for archiving digital language resources. Get started and register for an LDC Submissions user account today.

LDC’s language resources now include a Digital Object Identifier (DOI), an internationally recognized identification standard for online digital material. This means that LDC data sets have four persistent identifiers, a unique LDC number, ISBNISLRN and DOI. DOIs are alpha numeric strings that correspond to URLs and metadata for specified resources. They are expressed as links that resolve to the object’s online location. For example, the DOI for Penn Parsed Corpora of Historical English LDC2020T16 is https://doi.org/10.35111/4hzx-5483 which leads users to the LDC catalog entry for this data set. To facilitate its assignment and administration of DOIs, LDC has joined DataCite, a global DOI provider for research data. Adding DOIs is consistent with our aim to follow best practices for archiving and curating digital resources, evidenced by the CoreTrustSeal certification which recognizes the LDC Catalog as a trustworthy data repository.

 

 

Web pages about data management plans (DMPs) describe the Consortium’s capabilities to develop and implement project specific proposals. To satisfy requirements from funders like the National Science Foundation (NSF) that researchers deposit data in an accessible, trustworthy repository, LDC provides archiving services and makes data publicly available at a reasonable cost while protecting intellectual property rights and privacy concerns.

Browse the pages to learn more about the advantages of data center distribution, the details of NSF DMP requirements and the infrastructures and processes LDC has in place for storing and distributing resources over the long-term.