Past Projects

ACE (Automatic Content Extraction) 

In support of the ACE Program, LDC developed text corpora in English, Chinese and Arabic annotated for entities, the relations among them and the events in which they participate.

Dialectal Arabic Dictionary Project (US DOE)

LDC and Georgetown University Press completed work to update lexical Arabic databases for three Arabic dialects (with English translations). The databases are based on three GUP source dictionaries in Iraqi, Syrian and Moroccan dialects.

EARS (Effective, Affordable, Reusable Speech-to-Text) (DARPA)

To support EARS, LDC provided broadcast news and telephone conversations, transcripts, pronouncing lexicons and texts for language modeling in English, Chinese and Arabic.Specifically, LDC created high-quality careful transcripts of English, Chinese and Arabic conversational telephone and broadcast news speech to support the EARS Speech-to-Text (STT) evaluations. Also, LDC created annotated corpora and guidelines to support the EARS Metadata Extraction (MDE) program.

Emotional Prosody

LDC created and published a small corpus to support emotional prosody research. The data consists of recordings and transcripts of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning fourteen distinct emotional categories.


The goal of the FORM project was to develop a corpus annotated with multi-modal information concerning conversational interaction. LDC  prepared a corpus of gesture-annotated videos, adding layers of speech transcription and intonational information.

GALE (Global Autonomous Language Exploitation) (DARPA)

LDC developed integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE Program.

LCTL (Less Commonly Taught Languages)

LDC created linguistic resources including monolingual and parallel text, lexicons, encoding converters, word and sentence segmenters, morphological analyzers, named entity taggers, annotations, annotation infrastructre and specifications for a number of "less commonly taught languages," including Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Punjabi, Tagalog, Thai, Tamil, Urdu and Yoruba.

Machine Reading (DARPA)

The resources LDC provided for the Machine Reading program support the training and evaluation of reading systems that extracted targeted relations from text and then represented and converted that extracted information to formal knowledge.


LDC collects speech data from a large number of participants using a Mixer-style telephone collection platform. For some participants, the telephone calls are coupled with a series of socio-linguistic interviews conducted at facilities equipped with multichannel recording devices. These facilities are also used to make telephone calls that are recorded by the platform as well as on the multichannel devices. The participants represent a wide range of demographics, and we are collecting both mono-lingual and bi-lingual conversations in close to 30 languages, which include a variety of dialects and accents.

QLDB (Querying Linguistic Databases)

This project investigated data models and query languages for linguistic databases.


LDC prepared and transcribed audio files to support the second phase of speech recognition in noisy environments after completing work on SPINE1.

Switchboard (SWB) Cellular Phase II

This project involved a small switchboard collection in which each of 210 speakers participated in an average of 10 telephone calls. The data can be used to support research in speaker verification or speech recognition.

Switchboard (SWB) Cellular Transcription

LDC transcribed five minutes of each of 250 different conversations (500 sides) compiled from the SWB Cellular Phase I (GSM) collection. This data can be used for speech-to-text systems, R&D and evaluation under conditions of vocoded speech.

TalkBank (NSF)

TalkBank was an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data.

TIDES (Translingual Information Detection Extraction and Summarization) (DARPA)

LDC collected data in each supported Voice of America broadcast language to support this DARPA project. TIDES was supported by three distinct tasks, Extraction, HARD (high accuracy retrieval from documents) and TDT (topic detection and tracking).