Past Projects

ACE (Automatic Content Extraction) 

In support of the ACE Program, LDC developed text corpora in English, Chinese and Arabic annotated for entities, the relations among them and the events in which they participate.

AIDA (Active Interpretation of Disparate Alternatives) (DARPA)

AIDA’s goal was to develop a multi-hypothesis semantic engine that generates explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

BOLT (Broad Operational Language Translation) (DARPA)

The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. LDC supported the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebanking, PropBanking, co-reference and queries/responses. LDC also supported the evaluation of BOLT technologies by post-editing machine translation system output and assessing IR system responses during annual evaluations conducted by NIST.

COVID-19 Research

The COVID-19 pandemic highlighted the importance of data-driven solutions to facilitate rapid response and humanitarian relief, and its global nature demonstrated the need for multi-language resources. To aid in this effort, LDC released data it developed in the DARPA LORELEI program under a special no-cost license for COVID-19 research that was effective from June 2020-June 2021.

DEFT (Deep Exploration and Filtering of Test) (DARPA)

The DARPA DEFT Program developed automated systems to process text information and enable the understanding of connections in text that might not be readily apparent to humans. LDC supported the DEFT Program by collecting, creating and annotating a variety of data sources to support Smart Filtering, Relational Analysis and Anomaly Analysis.

Dialectal Arabic Dictionary Project (US Department of Education)

LDC and Georgetown University Press completed work to update lexical Arabic databases for three Arabic dialects (with English translations). The databases are based on three GUP source dictionaries in Iraqi, Syrian and Moroccan dialects.

EARS (Effective, Affordable, Reusable Speech-to-Text) (DARPA)

To support EARS, LDC provided broadcast news and telephone conversations, transcripts, pronouncing lexicons and texts for language modeling in English, Chinese and Arabic. Specifically, LDC created high-quality careful transcripts of English, Chinese and Arabic conversational telephone and broadcast news speech to support the EARS Speech-to-Text (STT) evaluations. Also, LDC created annotated corpora and guidelines to support the EARS Metadata Extraction (MDE) program.

Emotional Prosody

LDC created and published a small corpus to support emotional prosody research. The data consists of recordings and transcripts of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning fourteen distinct emotional categories.


The goal of the FORM project was to develop a corpus annotated with multi-modal information concerning conversational interaction. LDC  prepared a corpus of gesture-annotated videos, adding layers of speech transcription and intonational information.

GALE (Global Autonomous Language Exploitation) (DARPA)

LDC developed integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE Program.

HAVIC (Heterogeneous Audio Visual Internet Collection) (IARPA)

The HAVIC Corpus comprises thousands of hours of real-world amateur video data, annotated for features including topics and events depicted in the video (or its corresponding audio). It was used to support the NIST TRECVid Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) Evaluations.

Hearables Challenge (NSF)

The National Science Foundation sponsored the Hearables Challenge to develop algorithms or methods that could improve hearing in a noisy setting, especially in the very challenging situation of sustaining conversation where the background noise also includes conversation.

KMASS (Knowledge Management at Scale and Speed) (DARPA)

The KMASS program aimed to research, develop, integrate, evaluate, and demonstrate underlying technology that will enable effective use of documented knowledge, acquisition of new knowledge as part of regular workflows, and application of useful knowledge when and where it is required and with necessary granularity. LDC supported KMASS by collecting, creating and annotating multimodal linguistic resources focusing on the medical emergency and contingency operations domains.

Language Application Grid (NSF)

The Language Application Grid was an NSF-sponsored collaboration involving Vassar University, Brandeis University, Carnegie Mellon University and LDC. The stated goal was to develop a platform for natural language processing tools and resources that could be used and accessed by any researcher or developer.

Language Preservation 2.0: Crowdsourcing Oral Language Documentation Using Mobile Devices (NSF)

LDC and University of Melbourne joined forces to collect stories and oral histories from speakers of endangered languages in Brazil and Papua New Guinea. In addition to these recordings, collected via mobile hand-held devices, researchers also collected speaker information and transcribed the speech data. This work was supported by a Documenting Endangered Languages grant from NSF.

LCTL (Less Commonly Taught Languages)

LDC created linguistic resources including monolingual and parallel text, lexicons, encoding converters, word and sentence segmenters, morphological analyzers, named entity taggers, annotations, annotation infrastructre and specifications for a number of "less commonly taught languages," including Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Punjabi, Tagalog, Thai, Tamil, Urdu and Yoruba. 

LORELEI (Low Resource Languages for Emergent Incidents) (DARPA) 

LORELEI sought to identify the elements that different languages have in common and use that knowledge to enable rapid, low-cost development of automated language capabilities for use with low-resource languages for effective situational awareness. LDC supported LORELEI by collecting, creating and annotating linguistic resources in multiple languages. 

Machine Reading (DARPA)

The resources LDC provided for the Machine Reading program supported the training and evaluation of reading systems that extracted targeted relations from text and then represented and converted that extracted information to formal knowledge.

MADCAT (Multilingual Automatic Document Classification, Analysis and Translation) (DARPA)

The goal of the DARPA MADCAT Program was to automatically convert foreign text images into English transcripts. LDC supported MADCAT by collecting handwritten documents in Arabic and Chinese, scanning texts at a high resolution, annotating the physical coordinates of each line and token, and transcribing and translating the content into English. LDC also supported the evaluation of MADCAT technologies by post-editing machine translation system output during annual evaluations conducted by NIST.


The Mixer project promoted the development of robust speaker recognition technology by providing speech data in which large pools of speakers were simultaneously recorded across numerous microphones, in different communicative situations and/or in multiple languages. Mixer data has been used for technology evaluation in NIST's SRE (Speaker Recognition Evaluation) and LRE (Language Recognition Evaluation) campaigns since 2004. Mixer also included work completed for the LVDID projects and MIXER Greybeard. Mixer 4 and 5 Speech contains data collected by LDC and the International Computer Science Institute (ICSI) in 2007. Mixer 6 Speech contains data collected by LDC in 2009-2010.

NIEUW (Novel Incentives and Workflows in Linguistic Data Collection and Annotation) (NSF)

NIEUW was an LDC project supported by an NSF CISE Research Infrastructure planning grant. The goal was to build a framework to develop multilingual language resources employing crowdsourcing techniques proven to work in multiple scientific disciplines. 

OpenMT (Machine Translation) (NIST)

LDC supported the NIST Open Machine Translation (OpenMT) Evaluation series by developing test sets in multiple languages and genres and by sharing linguistic resources developed in other programs including DARPA GALE and TIDES. The objective of the OpenMT evaluation series was to support research in machine translation (MT) technologies -- technologies that translate text between human languages -- and to advance the state of the art in the MT field. Input may include all forms of text. The goal was for the output to be an adequate and fluent translation of the original.

Prosodic Systems in New Guinea (NSF)

This research was conducted jointly by UC Berkeley, University of Pennsylvania and the Australia National University. The purpose was to collect new bodies of recorded and transcribed data from undescribed tone languages in New Guinea using computational and theoretical methods to analyze the geographical distribution of tonal properties and the interaction of tone and other prosodic features.

QLDB (Querying Linguistic Databases)

This project investigated data models and query languages for linguistic databases.

RATS (Robust Automatic Transcription of Speech) (DARPA)

The DARPA RATS Program developed algorithms and software for performing basic speech processing on potentially speech-containing signals received over communication channels that are extremely noisy and/or highly distorted. LDC supported the RATS Program by collecting conversational data in multiple languages and annotating collected speech to provide training, development and test data for four tasks: Speech Activity Detection, Language ID, Speaker ID and Keyword Spotting. LDC also supported the evaluation of RATS technologies by adjudicating system output against human gold standard annotations.


LDC prepared and transcribed audio files to support the second phase of speech recognition in noisy environments after completing work on SPINE1.

Switchboard (SWB) Cellular Phase II

This project involved a small switchboard collection in which each of 210 speakers participated in an average of 10 telephone calls. The data can be used to support research in speaker verification or speech recognition.

Switchboard (SWB) Cellular Transcription

LDC transcribed five minutes of each of 250 different conversations (500 sides) compiled from the SWB Cellular Phase I (GSM) collection. This data can be used for speech-to-text systems, R&D and evaluation under conditions of vocoded speech. 

TAC (Text Analysis Conference) KBP (NIST) 

The Text Analysis Conference (TAC) was a series of evaluation workshops organized by NIST to encourage research in Natural Language Processing and related applications. LDC provided linguistic resources including source data, annotations and system assessment for the KBP (Knowledge Base Population) Track, which promoted research in automated systems that can discover information about named entities as found in a large corpus and incorporate this information into a knowledge base. 

TalkBank (NSF)

TalkBank was an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data.

TIDES (Translingual Information Detection Extraction and Summarization) (DARPA)

LDC collected data in each supported Voice of America broadcast language to support this DARPA project. TIDES was supported by three distinct tasks, Extraction, HARD (high accuracy retrieval from documents) and TDT (topic detection and tracking).