Past Projects

ACE (Automatic Content Extraction) 

In support of the ACE Program, LDC developed text corpora in English, Chinese and Arabic annotated for entities, the relations among them and the events in which they participate.

Dialectal Arabic Dictionary Project (US DOE)

LDC and Georgetown University Press completed work to update lexical Arabic databases for three Arabic dialects (with English translations). The databases are based on three GUP source dictionaries in Iraqi, Syrian and Moroccan dialects.

EARS (Effective, Affordable, Reusable Speech-to-Text) (DARPA)

To support EARS, LDC provided broadcast news and telephone conversations, transcripts, pronouncing lexicons and texts for language modeling in English, Chinese and Arabic.Specifically, LDC created high-quality careful transcripts of English, Chinese and Arabic conversational telephone and broadcast news speech to support the EARS Speech-to-Text (STT) evaluations. Also, LDC created annotated corpora and guidelines to support the EARS Metadata Extraction (MDE) program.

Emotional Prosody

LDC created and published a small corpus to support emotional prosody research. The data consists of recordings and transcripts of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning fourteen distinct emotional categories.

FORM

The goal of the FORM project was to develop a corpus annotated with multi-modal information concerning conversational interaction. LDC  prepared a corpus of gesture-annotated videos, adding layers of speech transcription and intonational information.

GALE (Global Autonomous Language Exploitation) (DARPA)

LDC developed integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE Program.

LCTL (Less Commonly Taught Languages)

LDC created linguistic resources including monolingual and parallel text, lexicons, encoding converters, word and sentence segmenters, morphological analyzers, named entity taggers, annotations, annotation infrastructre and specifications for a number of "less commonly taught languages," including Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Punjabi, Tagalog, Thai, Tamil, Urdu and Yoruba.

Machine Reading (DARPA)

The resources LDC provided for the Machine Reading program support the training and evaluation of reading systems that extracted targeted relations from text and then represented and converted that extracted information to formal knowledge.

MADCAT (Multilingual Automatic Document Classification, Analysis and Translation) (DARPA)

The goal of the DARPA MADCAT Program was to automatically convert foreign text images into English transcripts. LDC supported MADCAT by collecting handwritten documents in Arabic and Chinese, scanning texts at a high resolution, annotating the physical coordinates of each line and token, and transcribing and translating the content into English. LDC also supported the evaluation of MADCAT technologies by post-editing machine translation system output during annual evaluations conducted by NIST.

Mixer 

The Mixer project promoted the development of robust speaker recognition technology by providing speech data in which large pools of speakers were simultaneously recorded across numerous microphones, in different communicative situations and/or in multiple languages. Mixer data has been used for technology evaluation in NIST's SRE (Speaker Recognition Evaluation) and LRE (Language Recognition Evaluation) campaigns since 2004. Mixer also included work completed for the LVDID projects and MIXER Greybeard.

QLDB (Querying Linguistic Databases)

This project investigated data models and query languages for linguistic databases.

SPINE2

LDC prepared and transcribed audio files to support the second phase of speech recognition in noisy environments after completing work on SPINE1.

Switchboard (SWB) Cellular Phase II

This project involved a small switchboard collection in which each of 210 speakers participated in an average of 10 telephone calls. The data can be used to support research in speaker verification or speech recognition.

Switchboard (SWB) Cellular Transcription

LDC transcribed five minutes of each of 250 different conversations (500 sides) compiled from the SWB Cellular Phase I (GSM) collection. This data can be used for speech-to-text systems, R&D and evaluation under conditions of vocoded speech.

TalkBank (NSF)

TalkBank was an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data.

TIDES (Translingual Information Detection Extraction and Summarization) (DARPA)

LDC collected data in each supported Voice of America broadcast language to support this DARPA project. TIDES was supported by three distinct tasks, Extraction, HARD (high accuracy retrieval from documents) and TDT (topic detection and tracking).