Spotlight on LDC Data Scholarships

students working together

LDC’s Data Scholarship program awards eligible students with no-cost access to LDC data, supporting research projects in human language technology and other related fields. This program formalized the Consortium’s long-standing principle that no one with a bona fide research agenda and an inability to contribute should go without data (DiPersio & Cieri, 2016).  

Since the first scholarships in 2010, LDC has awarded 234 data sets valued at USD $416,925 to 156 students from 38 countries. Most recipients report that the data they received was vital to their research and many have published papers about their work.   

Barlian Henryranu Prasetio, a PhD candidate in the Department of Environmental Robotics at the University of Miyazaki (Japan) was awarded two LDC data sets in the Spring 2018 cycle, SUSAS (LDC99S78) and SUSAS Transcripts (LDC99T33), for his research project to train and test a voice stress recognition system. SUSAS, which stands for Speech Under Simulated and Actual Stress, is a speech database encompassing a wide variety of stresses and emotions including under noisy conditions. Barlian found the SUSAS data extremely valuable for his research. He continues to use these corpora in experiments and has published six research papers about his work (including Prasetio et al., 2018Prasetio et al., 2019Prasetio et al., 2020). Barlian credits the data scholarship program with having a positive impact on his academic career.   

Visit the Data Scholarships page for more information about the LDC Data Scholarship program, including application requirements, biannual deadlines, and evaluation criteria.

Data Management at LDC

computer servers

Human language technologies require large amounts of data to train, develop and test models and systems. There is a direct relationship between data quality and system effectiveness, that is, good data makes good systems.   

LDC ensures that the community has access to high-quality data sets through effective data management practices that cover such matters as accessibility, usability, curation and archiving.   

LDC’s data curation process includes intake, data review, quality checks, metadata creation, documentation development and archive management. Data deposited at LDC and published in the Catalog is archived in a logical data tree subject to a specialized backup system from which it can be migrated to new formats, platforms and storage media in accordance with best practices in the digital preservation community (DiPersio & Cieri, 2019).    

These activities add value to data and are critical for data discovery and retrieval, usability, and long-term data sharing and re-use for future research.   

LDC’s Catalog is a CoreTrustSeal-certified data repository that meets high standards for data access, rights management, curation, data integrity and authenticity, archival storage and security. The Catalog also consistently receives the highest (five-star) rating for metadata quality from the Open Language Archives Community (OLAC). 

Visit Data Management for information on using LDC data, contributing data to the Catalog, data management plans and citing LDC data. 

Classic Corpora in LDC’s Catalog: Penn Treebank

Treebank-2 CD label

 The LDC Catalog features classic corpora responsible for critical advances in human language technology that continue to influence researchers. Among them are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3 (LDC99T42)

The Penn Treebank project (1989-1996) produced seven million words tagged for part-of-speech, three million words of parsed text, over two million words annotated for predicate-argument structure and 1.6 million words of transcribed speech annotated for speech disfluencies (Taylor et al., 2003). Source material represents a diverse range of data, including Wall Street Journal (WSJ) articles, the Brown Corpus and Switchboard telephone conversations. 

Penn Treebanks are used for a wide range of purposes, including the creation and training of parsers and taggers, work on machine translation and speech recognition, and research concerning joint syntactic and semantic role labeling. Their ongoing influence is evidenced by the popularity of Treebank-3 (LDC99T42), which continues to be one of LDC’s top ten most distributed corpora in the Catalog. In addition, the WSJ section has served as a model for treebanks across many languages (Nivre, 2008).

The Penn Treebank has inspired related annotation schemes, such as Proposition Bank, the Penn Discourse Treebank project, and word alignment annotation. In addition, LDC has developed revised English treebank guidelines resulting in the re-issue of the WSJ section (English News Text Treebank: Penn Treebank Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank (LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum (LDC2020T09)).   

Penn Treebank corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data.