Current Projects

LDC is involved in a number of projects that support language education, research and technology development.

BOLT (Broad Operational Language Translation) (DARPA)

The DARPA BOLT Program will create new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. LDC supports the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebanking, PropBanking, co-reference and queries/responses. LDC also supports the evaluation of BOLT technologies by post-editing machine translation system output and assessing IR system responses during annual evaluations conducted by NIST.

DEFT (Deep Exploration and Filtering of Test) (DARPA)

The DARPA DEFT Program will develop automated systems to process text information and enable the understanding of connections in text that might not be readily apparent to humans. LDC supports the DEFT Program by collecting, creating and annotating a variety of data sources to support Smart Filtering, Relational Analysis and Anomaly Analysis.

HAVIC (Heterogenous Audio Visual Internet Collection)

The HAVIC Corpus comprises thousands of hours of real-world amateur video data, annotated for features including topics and events depicted in the video (or its corresponding audio). Currently, the HAVIC corpus is being used to support the NIST TRECVid Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) Evaluations.

Language Application Grid (NSF)

The Language Application Grid is an NSF-sponsored collaboration involving Vassar University, Brandeis University, Carnegie Mellon University and LDC. The stated goal is to develop a platform for natural language processing tools and resources that can be used and accessed by any researcher or developer.

Language Preservation 2.0: Crowdsourcing Oral Language Documentation Using Mobile Devices (NSF)

LDC and University of Melbourne have joined forces to collect stories and oral histories from speakers of endangered languages in Brazil and Papua New Guinea. In addition to these recordings, collected via mobile hand-held devices, researchers will also collect speaker information and transcribe the speech data. This work is supported by a Documenting Endangered Languages grant from NSF.

LRE (Language Recognition Evaluation) (NIST)

LDC develops linguistic resources to support the NIST LRE series. The LRE-11 corpus included narrowband broadcast news speech and conversational telephone speech in 24 languages, including several closely related/confusable varieties. Collection of the next LRE corpus is underway.

MADCAT (Multilingual Automatic Document Classification, Analysis and Translation) (DARPA)

The goal of the DARPA MADCATProgram is to automatically convert foreign text images into English transcripts. LDC supports MADCAT by collecting handwritten documents in Arabic and Chinese, scanning texts at a high resolution, annotating the physical coordinates of each line and token, and transcribing and translating the content into English. LDC also supports the evaluation of MADCAT technologies by post-editing machine translation system output during annual evaluations conducted by NIST.

Mixer (NIST)

The Mixer project promotes the development of robust speaker recognition technology by providing speech data in which large pools of speakers are simultaneously recorded across numerous microphones, in different communicative situations and/or in multiple languages. Mixer data has been used for technology evaluation in NIST's SRE (Speaker Recognition Evaluation) and LRE (Language Recognition Evaluation) campaigns since 2004. Mixer also includes work done for the LVDID projects and MIXER Greybeard.

OpenMT (Machine Translation) (NIST)

LDC supports the NIST Open Machine Translation (OpenMT) Evaluation series by developing test sets in multiple languages and genres and by sharing linguistic resources developed in other programs including DARPA GALE and TIDES. The objective of the OpenMT evaluation series is to support research in machine translation (MT) technologies -- technologies that translate text between human languages -- and to advance the state of the art in the MT field. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original.

For MT12, which took place in spring 2012, LDC provided source data and reference translations for the evaluation of Arabic, Chinese, Dari, Farsi, and Korean to English translations of newswire and web text.

Prosodic Systems in New Guinea (NSF)

This project is NSF-sponsored research conducted by Steven Bird in connection with UC Berkeley, University of Pennsylvania and the Australia National University. The project is collecting new bodies of recorded and transcribed data from undescribed tone languages in New Guinea. It will use computational and theoretical methods to analyze the geographical distribution of tonal properties and the interaction of tone and other prosodic features.

RATS (Robust Automatic Translation of Speech) (DARPA)

The DARPA RATS Program will develop algorithms and software for performing basic speech processing on potentially speech-containing signals received over communication channels that are extremely noisy and/or highly distorted. LDC supports the RATS Program by collecting conversational data in multiple languages and annotating collected speech to provide training, development and test data for four tasks: Speech Activity Detection, Language ID, Speaker ID and Keyword Spotting. LDC also supports the evaluation of RATS technologies by adjudicating system output against human gold standard annotations, as part of annual evaluations conducted by SAIC.

SRE (Speaker Recognition Evaluation) (NIST)

LDC develops linguistic resources to support the NIST Speaker Recognition Evaluation (SRE) series. For the SRE-12 evaluation, LDC collected multiple telephone calls from each of 414 English speakers who were also present in earlier SRE corpora. All calls were audited for language, speaker identity and other features.

TAC (Text Analysis Conference) KBP

The Text Analysis Conference (TAC) is a series of evaluation workshops organized by NIST to encourage research in Natural Language Processing and related applications. LDC provides linguistic resources including source data, annotations and system assessment for the KBP (Knowledge Base Population) Track, which promotes research in automated systems that can discover information about named entities as found in a large corpus and incorporate this information into a knowledge base.