What's New:

LDC is open for business and research efforts continue despite the restrictions imposed due to the coronavirus/COVID-19.

The University of Pennsylvania has announced extensive changes to its operations due to the coronavirus/COVID-19. In light of these developments and to protect our staff and local community, LDC staff is mostly working remotely.

However, we continue to provide the level of service that you have come to expect from us. Nearly all LDC corpora are available by web download. We continue to process corpora large enough to require delivery on media for recipients able to accept shipments. We will hold media for recipients currently unable to accept shipments. There may still be a brief delay in responding to voicemail messages, processing certain transactions and shipping media. During this time, we recommend that you contact the Membership Office by email, ldc@ldc.upenn.edu, to ensure a timely response to your inquiry.

To support COVID-19 related research, we have released, under a special no-cost license, LORELEI language packs for more than 20 low resource languages containing text, annotations, tools and more for rapid response in humanitarian relief scenarios. For further information about this corpus and licensing terms, visit the COVID-19 Research page.

Congratulations to the recipients of LDC's Fall 2020 data scholarships:

Nicole Dodd: University of California, Davis (USA); MA, Linguistics. Nicole is awarded a copy of Arabic Treebank Part 3 v. 3.2 LDC2010T08 for her research in relative clause processing in Standard Arabic.

Satwik Dutta: University of Texas at Dallas (USA); PhD, Electrical Engineering. Satwik is awarded copies of The CMU Kids Corpus LDC97263 and CLSU: Kids’ Speech Version 1.1. LDC2007S18 for his work in speech activity detection.

Pedram Hosseini: George Washington University (USA); PhD., Computer Science. Pedram is awarded copies of Penn Discourse Treebank Version 3.0 LDC2019T05 and The New York Times Annotated Corpus LDC2008T19 for his research in automatic detection of causal relations in text.

Mariano Maisonnave: Universidad Nacional del Sur (Argentina); PhD, Computer Science. Mariano is awarded a copy of ACE 2005 Multilingual Training Corpus LDC2006T06 for his work in event extraction.

Mark Sullivan: California State University, Los Angeles (USA); Masters, Applied and Advanced Studies in Education. Mark is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for his research in sentence boundary problems of Chinese L1 speakers in English compositions.

For information about the program, visit the Data Scholarships page.

As of July 2020, LDC’s language resources include a Digital Object Identifier (DOI), an internationally recognized identification standard for online digital material. DOIs are alpha numeric strings that correspond to URLs and metadata for specified resources. They are expressed as links that resolve to the object’s online location. For example, the DOI for Penn Parsed Corpora of Historical English LDC2020T16 is  https://doi.org/10.35111/4hzx-5483 which leads users to the LDC catalog entry for this data set. To facilitate its assignment and administration of DOIs, LDC has joined DataCite, a global DOI provider for research data. (DOIs for resources released before July 2020 will be assigned through a process expected to be completed shortly.) LDC data sets now have four persistent identifiers, a unique LDC number, ISBN, ISLRN and DOI. Adding DOIs is consistent with our aim to follow best practices for archiving and curating digital resources, evidenced by the CoreTrustSeal certification which recognizes the LDC Catalog as a trustworthy data repository.

Web pages about data management plans (DMPs) describe the Consortium’s capabilities to develop and implement project specific proposals. To satisfy requirements from funders like the National Science Foundation (NSF) that researchers deposit data in an accessible, trustworthy repository, LDC provides archiving services and makes data publicly available at a reasonable cost while protecting intellectual property rights and privacy concerns.

Browse the pages to learn more about the advantages of data center distribution, the details of NSF DMP requirements and the infrastructures and processes LDC has in place for storing and distributing resources over the long-term. 


We've revamped our user services to make it easier than ever to access LDC data. Now you can become an LDC member, request corpora, sign agreements and submit payment online directly from your LDC user account.

You’ll receive email notifications of key points in the transaction, when for instance, an order is created, agreements are signed, payment is received and data is shipped. You can also track the status of a transaction from your user account. 

Visit the new Managing Your LDC Account page to learn more about user accounts and their privileges and the steps for online transactions.

As always, thanks to our members, sponsors, collaborators and licensees for your continued support.

Podcasts from the complete set of staff interviews conducted as part of LDC's 20th Anniversary can be accessed from the LDC Blog. Hear what long-time staffers had to say about their experiences at LDC.

Christopher Cieri, Executive Director -- Chris reflects on the road that brought him to LDC, some of his early responsibilities and Consortium activities. 

Mohamed Maamouri, Senior Researcher -- Mohamed recounts his personal and professional experiences and comments on Arabic resource development at LDC.

David Graff, Lead Programmer -- Dave was one of LDC's first staff members and offers some insights on LDC's early days.

Yiwola Awoyale, Moussa Bamba, Researchers -- Yiwola and Moussa discuss how they came to LDC, their work on West African langauges and how it benefits multiple communities.

Natalia Bragilveskaya, Business Manager; Ilya Ahtaridis, Membership Coordinator; Marian Reed, Marketing Coordinator -- Natalia, Ilya and Marian recall the early days of LDC and the development of its interactions with the University of Pennsylvania, sponsors, members, licensees and collaborators.