New Corpora

Tokenized and Tagged Text for Shallow Discourse Parsing: 2015-2016 CoNLL Shared Task: Chinese and English training, development and test data for the 2015 and 2016 CoNLL Shared Task Evaluation, focused on shallow discourse parsing, containing the tokenized, tagged, and parsed tags in English and Chinese

Zulu Speech: IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e: developed by Appen, 211 hours of Zulu conversational and scripted telephone speech with transcripts collected in 2012 and 2013 from speakers aged 16 years to 70 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Recordings of Speaker Variation in Effort and Style: SRI-FRTIV: developed by SRI International in 2007-2008, 232 hours of English speech from 34  participantss at three different levels of effort (low, normal and high) in four different styles (interview, conversation, reading and oration) to study how intrinsic variations -- associated with the speaker rather than the recording environment -- affect text-independent speaker verification

Interviews of Flint, Michigan Residents: Vehicle City Voices Corpus – Part I: developed at the University of Michigan-Flint, 16 hours of speech with corresponding transcripts from interviews of Flint residents conducted between 2012 and 2015, designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces

South Asian Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- South Asian: developed by LDC, 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu, labeled for gender, dialect type and noise

English Discussion Forums: BOLT English Discussion Forums: developed by LDC, 830,440 discussion forum threads harvested from the Internet using a combination of manual and automatic processes

Emotional Arabic Speech: KSUEmotions: developed by King Saud University, five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects reading MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question)

Multi-Issue Bargaining Speech and Transcripts: Metalogue Multi-Issue Bargaining Dialogue: developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development, 2.5 hours of semantically annotated English bargaining dialogue data that includes speech and transcripts

Tamil Speech: IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b: developed by Appen, 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 from speakers ages 16 to 65 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 4 Arabic Broadcast Conversation Speech: 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009

GALE Phase 4 Arabic Broadcast Conversation Transcripts: the complete set of corresponding transcripts including 475,211 tokens in plain-text, tab delimited format with UTF-8 encoding