![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
||||
|
|
Linguistic Resources New Corpora ArchiveChronological list of our corpora releases for recent years. Please visit The LDC Corpus Catalog for a complete list of publications the LDC distributes.
2009 ReleasesChinese Gigaword Fourth Edition ~comprehensive archive of Chinese news text acquired by LDC
CSLU: S4X Release 1.2 ~36 speakers uttering 11 specified words
FactBank 1.0~news text with event mentions annotated with degree of factuality
Arabic Newswire English Translation Collection ~550K words of Arabic newswire text and its English translation
BioProp Version 1.0 ~proposition bank-style annotations for ~500 biomedical journal abstracts
Czech Broadcast Conversation Speech~40 hours of Czech broadcast conversation
Czech Broadcast Conversation MDE Transcripts ~transcribed and annotated Czech broadcast conversation speech
Spanish Gigaword Second Edition ~comprehensive archive of Spanish news text acquired by LDC
GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 ~240K characters of Chinese newsgroup text and its translation from 25 sources
Tagged Chinese Gigaword Version 2.0 ~part of speech tagged Chinese news text
2008 CoNLL Shared Task Data ~syntactic and semantic dependencies for Treebank-3 (LDC99T42) data
English Gigaword Fourth Edition ~comprehensive archive of English news text acquired by LDC
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 ~145K words of Arabic newsgroup text and its English translation from thirty-five sources
An English Dictionary of the Tamil Verb Second Edition ~ contains translations for 6597 English verbs and defines 9716 Tamil verbs
Japanese Web N-gram Version 1 ~Japanese "word" n-grams and their observed frequency counts
2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data ~data, reference translations and software used for NIST MetricsMATR
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 ~transcripts and English translations of twenty four hours of Chinese broadcast conversation programming
Unified Linguistic Annotation Text Collection ~effort to create a unified framework for different layers of annotation; available for free
Audiovisual Database of Spoken American English ~ seven hours of read speech from fourteen participants
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 ~178K words of Arabic newsgroup text and its English translation from thirty-five sources
CSLU: Numbers Version 1.3 ~fifteen hours of speech including isolated and continuous digit strings, and ordinal/cardinal numbers.
English CTS Treebank with Structural Metadata ~ metadata and syntactic structure annotations for 144 English telephone conversations
GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 ~ transcripts and English translations of over twenty hours of Chinese broadcast conversation programming
2008 ReleasesAQUAINT-2 Information-Retrieval Text Research Collection ~ data used for AQUAINT Question-Answer (QA) track.
Global Yoruba Lexical Database v. 1.0 ~set of related dictionaries with definitions and translations
CHAracterizing INdividual Speakers(CHAINS) ~read speech by thirty-six English speakers in different speaking styles
PennBioIE CYP 1.0 ~ 1100 medical abstracts with Penn Treebank-style annotation and biomedical entity tagging
PennBioIE Oncology 1.0 ~ 1400 medical abstracts with Penn Treebank-style annotation and biomedical entity tagging
Czech Academic Corpus 2.0 ~ 650K words manually annotated for morphology and syntax.
The New York Times Annotated Corpus ~summarized, indexed, algorithmically-tagged articles
CALLHOME Mandarin Chinese Transcripts - XML Version ~120 transcripts in XML format with retokenization and POS tagging
CSLU: ISOLET Spoken Letter Database Version 1.3 ~ letters of the English alphabet spoken in isolation by 150 speakers
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 ~ transcripts and English translations of 19 hours of Chinese broadcast news
BLLIP North American News Text, Complete and General Release ~ Penn Treebank-style parsing of over 20 million sentences from the North American News Text Corpus
North American News Text Complete and General Release~reissue of English news text to complement BLLIP North American News Text
CSLU: Alphadigit Version 1.3 ~ six-digit strings of letters and digits recorded over the telephone from 3025 speakers
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 ~transcripts and English translations of 22 hours of Chinese broadcast news
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 ~transcripts and English translations of 10 hours of Arabic broadcast news selected from a variety of sources
Chinese Proposition Bank 2.0 ~predicate-argument annotation of the Chinese Treebank 6.0
Hindi WordNet ~consists of 56K unique words and 26K synsets.
West Point Brazilian Portuguese Speech ~read speech from native and non-native speakers.
An English Dictionary of the Tamil Verb ~contains translations for 6597 English verbs and defines 9716 Tamil verbs
GALE Phase 1 Chinese Blog Parallel Text ~313K character of Chinese blog text and its translation from eight sources
CSLU: National Cellular Telephone Speech Release 2.3 ~approximately one minute of transcribed speech from 2336 speakers throughout the US
GALE Phase 1 Arabic Blog Parallel Text ~102K words of Arabic blog text and its English translation from thirty-three sources
STC-TIMIT 1.0 ~entire TIMIT database recorded through a single telephone channel
OntoNotes Release 2.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text
Penn Discourse Treebank Version 2.0 ~Wall Street Journal text annotated discourse relations and their arguments
ACE 2005 English SpatialML Annotations ~newswire text annotated for spatial expressions
CSLU: Portland Cellular Telephone Speech Version 1.3 ~cellular telephone speech with orthographic and phonetic transcription
Hungarian-English Parallel Text, Version 1.0 ~approximately two million sentence pairs plus additional resources for Hungarian
2007 Releases2004 Spring NIST Rich Transcription (RT-04S) Development Data ~development data used for speech-to-text and metadata extraction tasks
Chinese Treebank 6.0 (CTB 6.) ~780K words with POS-tagging and syntactic braketing
Arabic Gigaword Third Edition ~comprehensive archive of Arabic news text acquired by the LDC
CSLU: Kids' Speech Version 1.1 ~transcribed read and free response speech
GALE Phase 1 Distillation Training ~English, Chinese and/or Arabic queries and responses for the GALE Distillation task
2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks
MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)
~translated and aligned broadcast news transcripts
CSLU: Apple Words and Phrases ~telephone speech from over 3000 callers
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 23 hours of Chinese broadcast news selected from a variety of sources
Nationwide Speech Project ~read speech representing the primary regional varieties of American English
Chinese Gigaword Third Edition ~comprehensive archive of Chinese news text acquired by the LDC
2003 NIST Rich Transcription Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks
CSLU: Yes/No Version 1.2 ~18,000 speakers saying "yes" or "no" in response to various questions
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 17 hours of Arabic broadcast news selected from a variety of sources
Mandarin Affective Speech ~read speech in five different emotional states
2001 Topic Annotated Enron Email Data Set ~manually indexed email data set
Tagged Chinese Gigaword ~newstext annotated with full POS tags
CSLU: Foreign Accented English Release 1.2 ~free response English speech by native speakers of 22 languages
English Gigaword Third Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC
OntoNotes v 1.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text
ISI Chinese-English Automatically Extracted Parallel Text ~over 500K sentence pairs from newswire sources
TRECVID 2003 Keyframes & Transcripts ~keyframes extracted from English language broadcast programming
Fisher Levantine Arabic Conversational Telephone Speech and Transcripts ~279 transcribed telephone conversations totaling 45 hours of speech
TRECVID 2005 Keyframes & Transcripts ~keyframes extracted from Arabic, Chinese and English language broadcast programming
ARL Urdu Speech Database, Training Data ~transcribed read speech from 200 native speakers
ISI Arabic-English Automatically Extracted Parallel Text ~over 1 million sentence pairs from newswire sources
English Chinese Translation Treebank v. 1.0 ~English translation, part-of-speech tagged and treebanked
Levantine Arabic Conversational Telephone Speech, Transcripts ~transcribed conversations from over 900 speakers
|
|||
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
Contact ldc@ldc.upenn.edu |
||||