New Corpora

Lao Speech: IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a: developed by Appen, 207 hours of Lao conversational and scripted telephone speech collected in 2013 from speakers ages 16 to 60 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Turkish Telephone Speech: Multi-Language Conversational Telephone Speech 2011 – Turkish: developed by LDC, 18 hours of telephone speech in Turkish labeled for gender, dialect type and noise

English Anaphoric Coreference: Phrase Detectives Corpus: developed by the University of Essex, 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" designed to collect data about English anaphoric coreference

Civil Unrest News Text Annotated with Temporal Tags: The EventStatus Corpus: developed by researchers at Texas A&M University, Stanford University and The University of Utah, 3,000 English and 1,500 Spanish news articles about civil unrest events (protests, demonstrations, marches and strikes) annotated with temporal tags, appropriate for tasks such as event extraction and temporal question answering

SRE 2010 Telephone Speech: 2010 NIST Speaker Recognition Evaluation Test Set: developed by LDC, 2,255 hours of American English telephone and interview speech recorded over a microphone channel including two-channel telephone excerpts of approximately 10 seconds and 5 minutes and microphone excerpts that are 3-15 minutes in duration

Egyptian Arabic SMS/Chat Text: BOLT Egyptian Arabic SMS/Chat and Transliteration: developed by LDC, 5,691 conversations totaling 1,029,248 words across 262,026 messages natively written in either Arabic orthography or romanized Arabizi, including Arabizi conversations  transliterated into standard Arabic orthography

Noisy Speech for Distant-Microphone ASR: Noisy English Speech: CHiME2 Grid: developed as part of the 2nd CHiME Speech Separation and Recognition Challenge,120 hours of English speech from 34 speakers reading simple 6-word sequences in a noisy living room

Chinese Discussion Forum Parallel Text: BOLT Chinese Discussion Forum Parallel Training Data: developed by LDC, 1,876,799 tokens of Chinese discussion forum data with their corresponding English translations

Swahili Speech: IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d: developed by Appen, 200 hours of Swahili conversational and scripted telephone speech collected from 2012 to 2014 from speakers ages 16 to 65 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Noisy English Speech: Noisy TIMIT Speech: developed by the Florida Institute of Technology, approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels that include white, pink, blue, red, violet and babble noise with levels varying in 5 dB (decibel) steps, ranging from 5 to 50 dB

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE English-Chinese Parallel Aligned Treebank – Training: 196,123 tokens of word aligned English and Chinese parallel text with treebank annotations.