New Corpora

Transcribed Persian Speech: Corpus of Conversational Persian Transcripts: transcripts of 20 hours of informal telephone and face-to-face conversations in the Tehrani dialect of Iranian Persian, annotated for gender, age, recording method and setting

TAC KBP Evaluation Documents: TAC KBP Evaluation Source Corpora 2016-2017: 180,003 Chinese, English and Spanish discussion forum and newswire texts collected by LDC for the 2016 and 2017 TAC KBP evaluation tracks, with resources for recreating specific test sets

East Asian Telephone Collection: Multi-Language Conversational Telephone Speech 2011 -- East Asian: 19 hours of telephone speech in Thai and Lao, labeled for gender, dialect type and noise, collected by LDC to support automatic language identification

Igbo Speech: IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c: developed by Appen, 207 hours of Igbo conversational and scripted telephone speech with transcripts collected in 2014 and 2015 from speakers aged 16 to 67 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle

Chinese Articulography and Speech: The DKU-JNU-EMA Electromagnetic Articulography Database: developed by Duke Kunshan University and Jinan University, 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka and Teochew Chinese, subjects read compete sentences, short texts and related words of a common consonant vowel or tone while wearing sensors placed in parts of their mouth

English Anaphoric Coreference: Phrase Detectives Corpus Version 2: developed at the University of Essex, 407,000 tokens across 537 documents anaphorically-annotated by the online Phrase Detectives Game, updating the first version (LDC2017T08) with more annotated tokens, player judgments and annotation based on the probable aggregation method for anaphoric information

Diarization Challenge Evaluation Data, Annotation, Scoring Tool: First DIHARD Challenge Evaluation - Nine Sources, 18 hours of English and Chinese speech from a wide sampling of domains; First DIHARD Challenge Evaluation – SEEDLingS, 2 hours of English child language recordings; together comprising the evaluation set audio data, annotation and official scoring tool for the First DIHARD Challenge organized by LDC, Baidu,Laboratoire de Sciences Cognitives et Psycholinguistique, University of Science and Technology of China and Indian Institute of Science

Spanish Text Annotated for Committed Belief: DEFT Spanish Committed Belief Annotation: developed by LDC, 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text
 
MALACH English for Speech Recognition: USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition: developed by IBM, updates and enhances a subset of LDC2012S05 for use with speech recognition systems, such as the Kaldi toolkit, with new audio and transcript formats, a lexicon and development/test set division covering 168 hours of interviews from 682 Holocaust witnesses

Diarization Challenge Development Data, Annotation, Scoring Tool: First DIHARD Challenge Development - Eight Sources, 17 hours of English and Chinese speech from multispeaker environments; First DIHARD Challenge Development - SEEDLingS, two hours of English child recordings; together comprising the development set audio data, annotation (diarization, segmentation), and official scoring tool for the First DIHARD Challenge organized by LDC, Baidu, Laboratoire de Sciences Cognitives et Psycholinguistique, University of Science and Technology of China and Indian Institute of Science