New Corpora

ASpIRE Challenge: ASpIRE Development and Development Test Sets: developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA, 226 hours of English speech (from Mixer 6 Speech (LDC2013S03)) with transcripts and scoring files; the challenge focused on innovative systems trained on conversational telephone speech that could work well on far field microphone data from noisy reverberant rooms

Mexican Spanish Broadcast Speech: CIEMPIESS Light: an updated version of CIEMPIESS (LDC2015S07), developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico, 18 hours of Mexican Spanish radio and television speech and associated transcripts to create acoustic models for automatic speech recognition; this corpus presents data in a revised directory structure that allows for use with the Kaldi toolkit

Kurdish Speech: IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a: developed by Appen, 203 hours of Kurmanji Kurdish conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 years to 70 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Cross-lingual Entity Linking with Knowledge Base References: TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014: training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 2011, 2012, 2013 and 2014, including queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Chinese newswire, discussion forum and web data

Levantine Arabic and Farsi KWS Resources: RATS Keyword Spotting: 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts, and keywords generated from transcript content, created to provide training, development, and initial test sets for the keyword spotting (KWS) task in the DARPA RATS program

Propbank for English web data : English Web Treebank Propbank: developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research), predicate-argument structure annotation for English Web Treebank including semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses, and all nouns considered to be predicative

Annotated Ancient Chinese Text: Ancient Chinese Corpus: developed at Nanjing Normal University, word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC), including 180,000 Chinese characters and 195,000 segment units with words and punctuation

Dependency Annotation for MWEs and Named Entities: MWE-Aware English Dependency Corpus Version 2.0: developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory, English compound function words annotated in dependency format with added annotation of named entities (persons, locations, organizations), derived from OntoNotes Release 5.0

Tokenized and Tagged Text for Shallow Discourse Parsing: 2015-2016 CoNLL Shared Task: Chinese and English training, development and test data for the 2015 and 2016 CoNLL Shared Task Evaluation, focused on shallow discourse parsing, containing the tokenized, tagged, and parsed tags in English and Chinese

Zulu Speech: IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e: developed by Appen, 211 hours of Zulu conversational and scripted telephone speech with transcripts collected in 2012 and 2013 from speakers aged 16 years to 70 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Recordings of Speaker Variation in Effort and Style: SRI-FRTIV: developed by SRI International in 2007-2008, 232 hours of English speech from 34 participants at three different levels of effort (low, normal and high) in four different styles (interview, conversation, reading and oration) to study how intrinsic variations -- associated with the speaker rather than the recording environment -- affect text-independent speaker verification

Interviews of Flint, Michigan Residents: Vehicle City Voices Corpus – Part I: developed at the University of Michigan-Flint, 16 hours of speech with corresponding transcripts from interviews of Flint residents conducted between 2012 and 2015, designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces