New Corpora
Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  

New Corpora Archive


Chronological list of our corpora releases for recent years. Please visit The LDC Corpus Catalog for a complete list of publications the LDC distributes.

2008 Releases


OntoNotes Release 2.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text

Penn Discourse Treebank Version 2.0 ~Wall Street Journal text annotated discourse relations and their arguments

ACE 2005 English SpatialML Annotations ~newswire text annotated for spatial expressions

CSLU: Portland Cellular Telephone Speech Version 1.3 ~cellular telephone speech with orthographic and phonetic transcription

Hungarian-English Parallel Text, Version 1.0 ~approximately two million sentence pairs plus additional resources for Hungarian

[ top ]

2007 Releases


2004 Spring NIST Rich Transcription (RT-04S) Development Data ~development data used for speech-to-text and metadata extraction tasks

Chinese Treebank 6.0 (CTB 6.) ~780K words with POS-tagging and syntactic braketing

Arabic Gigaword Third Edition ~comprehensive archive of Arabic news text acquired by the LDC

CSLU: Kids' Speech Version 1.1 ~transcribed read and free response speech

GALE Phase 1 Distillation Training ~English, Chinese and/or Arabic queries and responses for the GALE Distillation task

2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks

MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE) ~translated and aligned broadcast news transcripts

CSLU: Apple Words and Phrases ~telephone speech from over 3000 callers

GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 23 hours of Chinese broadcast news selected from a variety of sources

Nationwide Speech Project ~read speech representing the primary regional varieties of American English

Chinese Gigaword Third Edition ~comprehensive archive of Chinese news text acquired by the LDC

2003 NIST Rich Transcription Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks

CSLU: Yes/No Version 1.2 ~18,000 speakers saying "yes" or "no" in response to various questions

GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 17 hours of Arabic broadcast news selected from a variety of sources

Mandarin Affective Speech ~read speech in five different emotional states

2001 Topic Annotated Enron Email Data Set ~manually indexed email data set

Tagged Chinese Gigaword ~newstext annotated with full POS tags

CSLU: Foreign Accented English Release 1.2 ~free response English speech by native speakers of 22 languages

English Gigaword Third Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC

OntoNotes v 1.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text

ISI Chinese-English Automatically Extracted Parallel Text ~over 500K sentence pairs from newswire sources

TRECVID 2003 Keyframes & Transcripts ~keyframes extracted from English language broadcast programming

Fisher Levantine Arabic Conversational Telephone Speech and Transcripts ~279 transcribed telephone conversations totaling 45 hours of speech

TRECVID 2005 Keyframes & Transcripts ~keyframes extracted from Arabic, Chinese and English language broadcast programming

ARL Urdu Speech Database, Training Data ~transcribed read speech from 200 native speakers

ISI Arabic-English Automatically Extracted Parallel Text ~over 1 million sentence pairs from newswire sources

English Chinese Translation Treebank v. 1.0 ~English translation, part-of-speech tagged and treebanked

Levantine Arabic Conversational Telephone Speech, Transcripts ~transcribed conversations from over 900 speakers

[ top ]

2006 Releases


Arabic Broadcast News Speech and Transcripts ~10 hours of transcribed satellite radio news

TDT5 Multilingual Text, Topics and Annotations ~English, Arabic, and Chinese newswire text with corresponding topic relevance annotations

Iraqi Arabic Conversational Telephone Speech and Transcripts ~transcribed conversations from over 250 speakers

French Gigaword First Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC

2004 NIST Speaker Recognition Evaluation ~training and test data from the 2004 evaluation

CSLU: Stories ~transcribed speech with time-aligned phonetic labels

West Point Heroicio Spanish Speech ~read and free response speech

Gulf Arabic Conversational Telephone Speech and Transcripts ~transcribed conversations from over 900 speakers

Web 1T 5-gram Version 1 ~English word n-grams and their observed frequency counts

Korean Broadcast News Speech and Transcripts ~transcribed VOA satellite radio news

West Point Korean Speech ~read and free response speech

Prague Dependency Treebank 2.0 ~morphological, syntactic, and semantic annotation of Czech news text

Russian through Switched Telephone Network (RuSTeN) ~multiple recorded calls from 125 speakers

CSLU: Multilanguage Telephone Speech Version 1.2 ~fixed vocabulary and continuous speech in eleven languages.

NIST 2003 Language Recognition Evaluation ~to test the detection of a given target language

Spanish Gigaword First Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC

CSLU: Speaker Recognition Version 1.1 ~transcribed speech from 90 speakers including the same utterances recorded multiple times

English-Arabic Treebank V1.0 ~52K Arabic words POS tagged and treebanked with parallel English translations

Middle East Technical University Turkish Microphone Speech V 1.0 ~transcribed speech from 120 speakers

CSLU Spoltech Brazilian Portuguese ~transcribed read and spontaneous speech

Korean Treebank Annotations Version 2.0 ~Korean texts annotated with morphological and syntactic information

N4 NATO Native and Non-Native Speech ~military oriented database of multilingual and non-native speech

Timebank 1.2 ~newstext annotated with temporal information, adding events, times and temporal links between events and times; free copies available!

CSLU: Spelled and Spoken Words ~transcribed speech from over 3000 speakers

Korean Propbanks ~semantic annotation containing over 30K annotated predicate tokens

Speech Controlled Computing ~supports the development of ASR applications in the domain of voice control for the home

ACE 2005 Multilingual Training Corpus ~Arabic, Chinese, and English data annotated for entities, relations, and events

Levantine Arabic QT Training Data Set 5, Speech and Transcripts ~250 hours of transcribed telephone speech

Arabic Gigaword Second Edition ~text from five Arabic news sources

CSLU Voices ~twelve speakers reading phonetically rich sentences

Multiple Translation Chinese (MTC) Part 4 ~human and machine translations of 100 Chinese news stories


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 22-Apr-2008 11:54:48 EDT
© 1992-2007 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.