New Corpora
Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  

New Corpora Archive


Chronological list of our corpora releases for recent years. Please visit The LDC Corpus Catalog for a complete list of publications that LDC distributes.

2013 Releases

Chinese Proposition Bank 3.0 ~ ~adds predicate-argument annotation on 187,731 words from Chinese Treebank 7.0

GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 ~ 115K tokens of word aligned Arabic and English parallel text with treebank annotations.

GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 ~ 21 source-translation document pairs of Chinese source text and its English translation

Greybeard ~ 590 hours of phone calls from Greybeard and legacy collections

Manually Annotated Sub-Corpus Third Release ~ 500K words of American English written and spoken data annotated for a wide variety of linguistic phenomena

GALE Arabic-English Parallel Aligned Treebank -- Newswire ~ ~267K tokens of word aligned Arabic and English parallel text with treebank annotations

MADCAT Phase 2 Training Set ~ handwritten Arabic documents annotated for physical coordinates and token

GALE Phase 2 Chinese Broadcast Conversation Speech ~ 120 hours of Chinese broadcast conversation speech

GALE Phase 2 Chinese Broadcast Conversation Transcripts ~ 1.5 million transcribed Chinese broadcast conversation data tokens

NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets ~ evaluation sets,DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English progress test sets

1993-2007 United Nations Parallel Text ~ ~673K raw text documents and 520K word alignment documents in the official languages of the UN

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web ~ 158K tokens of word aligned Chinese and English parallel text enriched with linguistic tags

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 ~ 123 hours of Arabic broadcast conversation speech

GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 ~ 752K transcribed Arabic broadcast conversation data tokens

NIST 2012 Open Machine Translation (OpenMT) Evaluation ~ 222 Chinese newswire and web data documents with corresponding source and reference files

Chinese-English Biology and Chemistry Abstract Parallel Text ~ ~2000 parallel sentences from scientific article abstracts published in Mandarin and translated into English

GALE Phase 2 Arabic Web Parallel Text ~ 42K words of Arabic source text and its English translation

2012 Releases

GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web ~ 154K tokens of word aligned Chinese and English parallel text enriched with linguistic tags

Russian-English Computer Security Parallel Text ~ 6000 parallel sentences from a set of computer security reports published in Russian and translated into English

Annotated English Gigaword ~ adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07)

Chinese-English Semiconductor Parallel Text ~ ~2000 parallel sentences from abstracts of scientific articles on semiconductors published in Mandarin and translated into English

GALE Phase 2 Arabic Newswire Parallel Text ~ ~400 source-translation pairs, comprising 181K tokens of Arabic source text and its English translation

GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire ~ 169K tokens of word aligned Chinese and English parallel text enriched with linguistic tags

GALE Phase 2 Arabic Broadcast News Parallel Text ~ 29K of Arabic source text and its English translation

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web ~ 150K tokens of word aligned Chinese and English parallel text enriched with linguistic tags

MADCAT Phase 1 Training Set ~ handwritten Arabic documents annotated for physical coordinates and token.

English Web Treebank ~ 250K words of English web text manually annotated for syntactic structure; first 50 copies available at no-cost.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 ~ 169K words of Arabic source text and its English translation.

Spanish TimeBank 1.0 ~ stand-off annotations for 210 documents with over 75Ktokens; available at no-cost

American English Nickname Collection ~ 331K distinct mappings between nicknames and given names; available at no-cost.

Arabic Treebank - Broadcast News v1.0 ~ 120 transcribed Arabic broadcast news stories with part-of-speech, morphology, gloss and syntactic tree annotation.

Catalan TimeBank 1.0 ~ stand-off annotations for 210 documents with over 75K tokens; available at no-cost.

Arabic-Dialect/English Parallel Text ~ 3.5 million tokens of Arabic dialect sentences and their English translations.

Prague Czech-English Dependency Treebank 2.0 ~ Czech-English parallel resources annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution.

Chinese Dependency Treebank 1.0 ~ 49K Chinese sentences annotated with syntactic dependency structures.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 ~ 169K words of Arabic broadcast conversation source text and corresponding English translations.

Turkish Broadcast News Speech and Transcripts ~ 130 hours of VOA Turkish radio broadcasts and corresponding transcripts.

2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News ~ 60 hours of English broadcast news video data annotated for 2005 VACE tasks.

2009 CoNLL Shared Task Part 1 ~ Catalan, Czech, German and Spanish data used for 2009 CoNLL.

2009 CoNLL Shared Task Part 2 ~ Chinese and English data used for 2009 CoNLL.

USC-SFI MALACH Interviews and Transcripts English ~ 375 hours of interviews from 784 interviewees along with transcripts.

English Translation Treebank: An-Nahar Newswire ~ 599 newswire stories translated from Arabic to English and annotated for POS and syntactic structure.

Digital Archive of Southern Speech ~ 370 hours of English speech data from 30 female speakers and 34 male speakers

ModeS TimeBank 1.0 ~ Modern Spanish test annotated with TimeML and SpatialML mark-up.

2006 NIST Speaker Recognition Evaluation Test Set Part 2 ~ 568 hours of multilingual conversational telephone and microphone speech

TORGO Database of Dysarthric Articulation ~ 23 hours of transcribed English speech data from dysarthric and non-dysarthric speakers


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Wednesday, 18-Sep-2013 12:16:44 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.