New Corpora

2013 Data Pack: create a custom collection of eight corpora from among LDC’s 2013 publications, available to not-for-profit and government organizations through September 15, 2015 only

Spanish and Catalan Lexicons: SenSem Lexicons: developed by GRIAL, feature descriptions for 1,300 Spanish verbs and 1,300 Catalan verbs codified manually – including definition, WordNet synset, Aktionsart, arguments and semantic function – or extracted automatically from the SenSem Databank

Penn Treebank-3 Annotation: Coordination Annotation for the Penn Treebank: developed by researchers at the University of Düsseldorf and Indiana University, stand-off annotation for the Wall Street Journal portion of Treebank-3 marking all tokens that have a coordinating function in UTF-8 plain text  

Phonetic Segmentation and Tone Labels: Mandarin Chinese Phonetic Segmentation and Tone:7,849 Mandarin Chinese utterances derived from broadcast recordings, segmented and labeled and separated into training and test sets

Subglottal Resonances: The Subglottal Resonances Database:developed by Washington University and University of California Los Angeles, 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English for use in automatic speech recognition and in studies of speech production, perception and technology 

Code Switching Speech: Mandarin-English Code-Switching in South-East Asia: developed by Nanyang Technological University and Universiti Sains Malaysia, 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts 

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2: 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2: the complete set of corresponding transcripts including 1,388,236 tokens in plain-text, tab delimited format with UTF-8 encoding 

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 3 and 4 Arabic Broadcast News Parallel Text: 86 source-translation document pairs comprising 325,538 words of Arabic source text and its English translation

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text: 55 source-translation document pairs comprising 280,535 words of Arabic source text and its English translation

GALE Chinese-English Parallel Aligned Treebank -- Training: 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations