New Corpora

2013 Data Pack: create a custom collection of eight corpora from among LDC’s 2013 publications, available to not-for-profit and government organizations through September 15, 2015 only

Penn Treebank Revised: English News Text Treebank: Penn Treebank Revised: developed by LDC with funding through a gift from Google Inc., automated and manual revisions – including revised tokenization, part-of-speech and syntactic treebank annotation – for the Penn Treebank Wall Street Journal data including 1,203,648 word-level tokens in 49,191 sentence-level tokens – in all 2,312 of the original PTB WSJ files

Turkish Web Text: TS Wikipedia: 1.6 million processed and tokenized Turkish Wikipedia pages, including part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams

Navigational Telephone Speech: The Walking Around Corpus: developed by Stony Brook University, 33 hours of navigational telephone dialogues from 72 speakers in pairs including visual materials used to elicit dialogues, data about the speakers’ relationship, spatial abilities and memory performance and more 

Mexican Spanish radio speech: CIEMPIESS: developed by the Speech Processing Laboratory at the National Autonomous University of Mexico, 18 hours of Mexican Spanish radio speech, associated transcripts, pronouncing dictionaries and language models

Signalling rhetorical relations: RST Signalling Corpus: developed at Simon Fraser University, annotations of text signals (although, because, thus), tense, lexical chains or punctuation added to the rhetorical mark-up of Wall Street Journal news articles in RST Discourse Treebank

Spanish and Catalan Lexicons: SenSem Lexicons: developed by GRIAL, feature descriptions for 1,300 Spanish verbs and 1,300 Catalan verbs codified manually – including definition, WordNet synset, Aktionsart, arguments and semantic function – or extracted automatically from the SenSem Databank

Penn Treebank-3 Annotation: Coordination Annotation for the Penn Treebank: developed by researchers at the University of Düsseldorf and Indiana University, stand-off annotation for the Wall Street Journal portion of Treebank-3 marking all tokens that have a coordinating function in UTF-8 plain text  

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2: 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2: the complete set of corresponding transcripts including 1,388,236 tokens in plain-text, tab delimited format with UTF-8 encoding 

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences: 63,829 tokens of Chinese source text and its English translation