Past Projects

ACE (Automatic Content Extraction) 

In support of the ACE Program, LDC developed text corpora in English, Chinese and Arabic annotated for entities, the relations among them and the events in which they participate.

BOLT (Broad Operational Language Translation) (DARPA)

The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. LDC supported the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebanking, PropBanking, co-reference and queries/responses. LDC also supported the evaluation of BOLT technologies by post-editing machine translation system output and assessing IR system responses during annual evaluations conducted by NIST.

Dialectal Arabic Dictionary Project (US Department of Education)

LDC and Georgetown University Press completed work to update lexical Arabic databases for three Arabic dialects (with English translations). The databases are based on three GUP source dictionaries in Iraqi, Syrian and Moroccan dialects.

EARS (Effective, Affordable, Reusable Speech-to-Text) (DARPA)

To support EARS, LDC provided broadcast news and telephone conversations, transcripts, pronouncing lexicons and texts for language modeling in English, Chinese and Arabic. Specifically, LDC created high-quality careful transcripts of English, Chinese and Arabic conversational telephone and broadcast news speech to support the EARS Speech-to-Text (STT) evaluations. Also, LDC created annotated corpora and guidelines to support the EARS Metadata Extraction (MDE) program.

Emotional Prosody

LDC created and published a small corpus to support emotional prosody research. The data consists of recordings and transcripts of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning fourteen distinct emotional categories.


The goal of the FORM project was to develop a corpus annotated with multi-modal information concerning conversational interaction. LDC  prepared a corpus of gesture-annotated videos, adding layers of speech transcription and intonational information.

GALE (Global Autonomous Language Exploitation) (DARPA)

LDC developed integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE Program.

HAVIC (Heterogeneous Audio Visual Internet Collection) (IARPA)

The HAVIC Corpus comprises thousands of hours of real-world amateur video data, annotated for features including topics and events depicted in the video (or its corresponding audio). It was used to support the NIST TRECVid Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) Evaluations.

Hearables Challenge (NSF)

The National Science Foundation sponsored the Hearables Challenge to develop algorithms or methods that could improve hearing in a noisy setting, especially in the very challenging situation of sustaining conversation where the background noise also includes conversation.

Language Preservation 2.0: Crowdsourcing Oral Language Documentation Using Mobile Devices (NSF)

LDC and University of Melbourne joined forces to collect stories and oral histories from speakers of endangered languages in Brazil and Papua New Guinea. In addition to these recordings, collected via mobile hand-held devices, researchers also collected speaker information and transcribed the speech data. This work was supported by a Documenting Endangered Languages grant from NSF.

LCTL (Less Commonly Taught Languages)

LDC created linguistic resources including monolingual and parallel text, lexicons, encoding converters, word and sentence segmenters, morphological analyzers, named entity taggers, annotations, annotation infrastructre and specifications for a number of "less commonly taught languages," including Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Punjabi, Tagalog, Thai, Tamil, Urdu and Yoruba.

Machine Reading (DARPA)

The resources LDC provided for the Machine Reading program supported the training and evaluation of reading systems that extracted targeted relations from text and then represented and converted that extracted information to formal knowledge.

MADCAT (Multilingual Automatic Document Classification, Analysis and Translation) (DARPA)

The goal of the DARPA MADCAT Program was to automatically convert foreign text images into English transcripts. LDC supported MADCAT by collecting handwritten documents in Arabic and Chinese, scanning texts at a high resolution, annotating the physical coordinates of each line and token, and transcribing and translating the content into English. LDC also supported the evaluation of MADCAT technologies by post-editing machine translation system output during annual evaluations conducted by NIST.


The Mixer project promoted the development of robust speaker recognition technology by providing speech data in which large pools of speakers were simultaneously recorded across numerous microphones, in different communicative situations and/or in multiple languages. Mixer data has been used for technology evaluation in NIST's SRE (Speaker Recognition Evaluation) and LRE (Language Recognition Evaluation) campaigns since 2004. Mixer also included work completed for the LVDID projects and MIXER Greybeard. Mixer 6 Speech contains data collected for the project by LDC in 2009-2010.

Prosodic Systems in New Guinea (NSF)

This research was conducted jointly by UC Berkeley, University of Pennsylvania and the Australia National University. The purpose was to collect new bodies of recorded and transcribed data from undescribed tone languages in New Guinea using computational and theoretical methods to analyze the geographical distribution of tonal properties and the interaction of tone and other prosodic features.

QLDB (Querying Linguistic Databases)

This project investigated data models and query languages for linguistic databases.

RATS (Robust Automatic Transcription of Speech) (DARPA)

The DARPA RATS Program developed algorithms and software for performing basic speech processing on potentially speech-containing signals received over communication channels that are extremely noisy and/or highly distorted. LDC supported the RATS Program by collecting conversational data in multiple languages and annotating collected speech to provide training, development and test data for four tasks: Speech Activity Detection, Language ID, Speaker ID and Keyword Spotting. LDC also supported the evaluation of RATS technologies by adjudicating system output against human gold standard annotations.


LDC prepared and transcribed audio files to support the second phase of speech recognition in noisy environments after completing work on SPINE1.

Switchboard (SWB) Cellular Phase II

This project involved a small switchboard collection in which each of 210 speakers participated in an average of 10 telephone calls. The data can be used to support research in speaker verification or speech recognition.

Switchboard (SWB) Cellular Transcription

LDC transcribed five minutes of each of 250 different conversations (500 sides) compiled from the SWB Cellular Phase I (GSM) collection. This data can be used for speech-to-text systems, R&D and evaluation under conditions of vocoded speech.

TalkBank (NSF)

TalkBank was an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data.

TIDES (Translingual Information Detection Extraction and Summarization) (DARPA)

LDC collected data in each supported Voice of America broadcast language to support this DARPA project. TIDES was supported by three distinct tasks, Extraction, HARD (high accuracy retrieval from documents) and TDT (topic detection and tracking).