New Corpora

South Asian Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- South Asian: developed by LDC, 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu, labeled for gender, dialect type and noise

English Discussion Forums: BOLT English Discussion Forums: developed by LDC, 830,440 discussion forum threads harvested from the Internet using a combination of manual and automatic processes

Emotional Arabic Speech: KSUEmotions: developed by King Saud University, five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects reading MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question)

Multi-Issue Bargaining Speech and Transcripts: Metalogue Multi-Issue Bargaining Dialogue: developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development, 2.5 hours of semantically annotated English bargaining dialogue data that includes speech and transcripts

Tamil Speech: IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b: developed by Appen, 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 from speakers ages 16 to 65 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

English Semantic Treebank: Abstract Meaning Representation (AMR) Annotation Release 2.0: developed by USC’s  Information Sciences Institute, LDC, SDL/Language Weaver, Inc. and the University of Colorado, a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums

Noisy Utterances for Distant-Microphone ASR: CHiME2 WSJ0: developed as part of the 2nd CHiME Speech Separation and Recognition Challenge’s medium vocabulary track, approximately 166 hours of English utterances taken from Wall Street Journal news text in a noisy living room environment

Laryngeal Video and Audio: UCLA High-Speed Laryngeal Video and Audio: developed by UCLA Speech Processing and Auditory Perception Laboratory, high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings from nine subjects asked to sustain the vowel /i/ for approximately ten seconds while holding voice quality, fundamental frequency, and loudness as steady as possible

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 4 Arabic Broadcast Conversation Speech: 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009

GALE Phase 4 Arabic Broadcast Conversation Transcripts: the complete set of corresponding transcripts including 475,211 tokens in plain-text, tab delimited format with UTF-8 encoding