New Corpora
Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  

New Corpora Archive


Chronological list of our corpora releases for recent years. Please visit The LDC Corpus Catalog for a complete list of publications the LDC distributes.

2009 Releases

Chinese Gigaword Fourth Edition ~comprehensive archive of Chinese news text acquired by LDC

CSLU: S4X Release 1.2 ~36 speakers uttering 11 specified words

FactBank 1.0~news text with event mentions annotated with degree of factuality

Arabic Newswire English Translation Collection ~550K words of Arabic newswire text and its English translation

BioProp Version 1.0 ~proposition bank-style annotations for ~500 biomedical journal abstracts

Czech Broadcast Conversation Speech~40 hours of Czech broadcast conversation

Czech Broadcast Conversation MDE Transcripts ~transcribed and annotated Czech broadcast conversation speech

Spanish Gigaword Second Edition ~comprehensive archive of Spanish news text acquired by LDC

GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 ~240K characters of Chinese newsgroup text and its translation from 25 sources

Tagged Chinese Gigaword Version 2.0 ~part of speech tagged Chinese news text

2008 CoNLL Shared Task Data ~syntactic and semantic dependencies for Treebank-3 (LDC99T42) data

English Gigaword Fourth Edition ~comprehensive archive of English news text acquired by LDC

GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 ~145K words of Arabic newsgroup text and its English translation from thirty-five sources

An English Dictionary of the Tamil Verb Second Edition ~ contains translations for 6597 English verbs and defines 9716 Tamil verbs

Japanese Web N-gram Version 1 ~Japanese "word" n-grams and their observed frequency counts

2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data ~data, reference translations and software used for NIST MetricsMATR

GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 ~transcripts and English translations of twenty four hours of Chinese broadcast conversation programming

Unified Linguistic Annotation Text Collection ~effort to create a unified framework for different layers of annotation; available for free

Audiovisual Database of Spoken American English ~ seven hours of read speech from fourteen participants

GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 ~178K words of Arabic newsgroup text and its English translation from thirty-five sources

CSLU: Numbers Version 1.3 ~fifteen hours of speech including isolated and continuous digit strings, and ordinal/cardinal numbers.

English CTS Treebank with Structural Metadata ~ metadata and syntactic structure annotations for 144 English telephone conversations

GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 ~ transcripts and English translations of over twenty hours of Chinese broadcast conversation programming

[ top ]

2008 Releases

AQUAINT-2 Information-Retrieval Text Research Collection ~ data used for AQUAINT Question-Answer (QA) track.

Global Yoruba Lexical Database v. 1.0 ~set of related dictionaries with definitions and translations

CHAracterizing INdividual Speakers(CHAINS) ~read speech by thirty-six English speakers in different speaking styles

PennBioIE CYP 1.0 ~ 1100 medical abstracts with Penn Treebank-style annotation and biomedical entity tagging

PennBioIE Oncology 1.0 ~ 1400 medical abstracts with Penn Treebank-style annotation and biomedical entity tagging

Czech Academic Corpus 2.0 ~ 650K words manually annotated for morphology and syntax.

The New York Times Annotated Corpus ~summarized, indexed, algorithmically-tagged articles

CALLHOME Mandarin Chinese Transcripts - XML Version ~120 transcripts in XML format with retokenization and POS tagging

CSLU: ISOLET Spoken Letter Database Version 1.3 ~ letters of the English alphabet spoken in isolation by 150 speakers

GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 ~ transcripts and English translations of 19 hours of Chinese broadcast news

BLLIP North American News Text, Complete and General Release ~ Penn Treebank-style parsing of over 20 million sentences from the North American News Text Corpus

North American News Text Complete and General Release~reissue of English news text to complement BLLIP North American News Text

CSLU: Alphadigit Version 1.3 ~ six-digit strings of letters and digits recorded over the telephone from 3025 speakers

GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 ~transcripts and English translations of 22 hours of Chinese broadcast news


2005 NIST Language Recognition Evaluation ~ data, answer keys, and scoring script for the 2005 NIST LRE

GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 ~transcripts and English translations of 10 hours of Arabic broadcast news selected from a variety of sources

Chinese Proposition Bank 2.0 ~predicate-argument annotation of the Chinese Treebank 6.0

Hindi WordNet ~consists of 56K unique words and 26K synsets.

West Point Brazilian Portuguese Speech ~read speech from native and non-native speakers.

An English Dictionary of the Tamil Verb ~contains translations for 6597 English verbs and defines 9716 Tamil verbs

GALE Phase 1 Chinese Blog Parallel Text ~313K character of Chinese blog text and its translation from eight sources

CSLU: National Cellular Telephone Speech Release 2.3 ~approximately one minute of transcribed speech from 2336 speakers throughout the US

GALE Phase 1 Arabic Blog Parallel Text ~102K words of Arabic blog text and its English translation from thirty-three sources

STC-TIMIT 1.0 ~entire TIMIT database recorded through a single telephone channel

OntoNotes Release 2.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text

Penn Discourse Treebank Version 2.0 ~Wall Street Journal text annotated discourse relations and their arguments

ACE 2005 English SpatialML Annotations ~newswire text annotated for spatial expressions

CSLU: Portland Cellular Telephone Speech Version 1.3 ~cellular telephone speech with orthographic and phonetic transcription

Hungarian-English Parallel Text, Version 1.0 ~approximately two million sentence pairs plus additional resources for Hungarian

[ top ]

2007 Releases


2004 Spring NIST Rich Transcription (RT-04S) Development Data ~development data used for speech-to-text and metadata extraction tasks

Chinese Treebank 6.0 (CTB 6.) ~780K words with POS-tagging and syntactic braketing

Arabic Gigaword Third Edition ~comprehensive archive of Arabic news text acquired by the LDC

CSLU: Kids' Speech Version 1.1 ~transcribed read and free response speech

GALE Phase 1 Distillation Training ~English, Chinese and/or Arabic queries and responses for the GALE Distillation task

2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks

MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE) ~translated and aligned broadcast news transcripts

CSLU: Apple Words and Phrases ~telephone speech from over 3000 callers

GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 23 hours of Chinese broadcast news selected from a variety of sources

Nationwide Speech Project ~read speech representing the primary regional varieties of American English

Chinese Gigaword Third Edition ~comprehensive archive of Chinese news text acquired by the LDC

2003 NIST Rich Transcription Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks

CSLU: Yes/No Version 1.2 ~18,000 speakers saying "yes" or "no" in response to various questions

GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 17 hours of Arabic broadcast news selected from a variety of sources

Mandarin Affective Speech ~read speech in five different emotional states

2001 Topic Annotated Enron Email Data Set ~manually indexed email data set

Tagged Chinese Gigaword ~newstext annotated with full POS tags

CSLU: Foreign Accented English Release 1.2 ~free response English speech by native speakers of 22 languages

English Gigaword Third Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC

OntoNotes v 1.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text

ISI Chinese-English Automatically Extracted Parallel Text ~over 500K sentence pairs from newswire sources

TRECVID 2003 Keyframes & Transcripts ~keyframes extracted from English language broadcast programming

Fisher Levantine Arabic Conversational Telephone Speech and Transcripts ~279 transcribed telephone conversations totaling 45 hours of speech

TRECVID 2005 Keyframes & Transcripts ~keyframes extracted from Arabic, Chinese and English language broadcast programming

ARL Urdu Speech Database, Training Data ~transcribed read speech from 200 native speakers

ISI Arabic-English Automatically Extracted Parallel Text ~over 1 million sentence pairs from newswire sources

English Chinese Translation Treebank v. 1.0 ~English translation, part-of-speech tagged and treebanked

Levantine Arabic Conversational Telephone Speech, Transcripts ~transcribed conversations from over 900 speakers

[ top ]


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Friday, 20-Nov-2009 16:22:55 EST
© 1992-2009 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.