New Corpora

Chinese Discussion Forum Parallel Text: BOLT Chinese Discussion Forum Parallel Training Data: developed by LDC, 1,876,799 tokens of Chinese discussion forum data with their corresponding English translations

Swahili Speech: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d: developed by Appen, 200 hours of Swahili conversational and scripted telephone speech collected from 2012 to 2014 from speakers ages 16 to 65 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Noisy English Speech: Noisy TIMIT Speech: developed by the Florida Institute of Technology, approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels that include white, pink, blue, red, violet and babble noise with levels varying in 5 dB (decibel) steps, ranging from 5 to 50 dB

English Legal Briefs: First-Year Law Students' Court Memoranda: 197 law student English writing samples annotated with accompanying survey responses by student writers created to apply natural language processing approaches to determine any differences in the briefs' language attributable to the students' self-reported genders

Haitian Creole Speech: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b: developed by Appen, 203 hours of Haitian Creole conversational and scripted telephone speech collected in 2012 and 2013 from speakers ages 16 to 75 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Arabic Pronunciation Dictionary: Arabic Speech Recognition Pronunciation Dictionary: developed by the Qatar Computing Research Institute, two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word

Vietnamese Speech: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7: developed by Appen, 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 from speakers ages 16 to 64 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

English Compound Function Words: MWE-Aware English Dependency Corpus: developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory, English compound function words annotated in dependency format, derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Arabic Broadcast News Speech Part 2: 128 hours of Arabic broadcast news speech collected in 2007

GALE Phase 3 Arabic Broadcast News Transcripts Part 2: the complete set of corresponding transcripts including 721,846 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE English-Chinese Parallel Aligned Treebank – Training: 196,123 tokens of word aligned English and Chinese parallel text with treebank annotations. 

GALE Phase 3 and 4 Chinese Web Parallel Text: 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.