New Corpora

Spanish Text Annotated for Committed Belief: DEFT Spanish Committed Belief Annotation: developed by LDC, 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text
MALACH English for Speech Recognition: USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition: developed by IBM, updates and enhances a subset of LDC2012S05 for use with speech recognition systems, such as the Kaldi toolkit, with new audio and transcript formats, a lexicon and development/test set division covering 168 hours of interviews from 682 Holocaust witnesses

Diarization Challenge Development Data, Annotation, Scoring Tool: First DIHARD Challenge Development - Eight Sources, 17 hours of English and Chinese speech from multispeaker environments; First DIHARD Challenge Development - SEEDLingS, two hours of English child recordings; together comprising the development set audio data, annotation (diarization, segmentation), and official scoring tool for the First DIHARD Challenge organized by LDC, Baidu, Laboratoire de Sciences Cognitives et Psycholinguistique, University of Science and Technology of China and Indian Institute of Science

American and South Asian English Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- English Group: 18 hours of telephone speech in American and South Asian English, labeled for gender, dialect type and noise, collected by LDC to support automatic language identification

Mining Entity Information from Chinese Text: TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014: developed by LDC for the TAC KBP Chinese Regular Slot Filling evaluation track in 2014,  includes queries, manual runs, final rounds of assessment results, and Chinese source documents

Mexican Spanish Speech for Acoustic Modeling: CIEMPIESS Experimentation: developed at the National Autonomous University of Mexico, 22 hours of Mexican Spanish broadcast and read speech with associated transcripts and tools for creating pronouncing dictionaries

Guarani Speech: IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c: developed by Appen, 198 hours of Guarani conversational and scripted telephone speech with transcripts collected in 2014 and 2015 from speakers aged 16 to 67 years old using different telephones in a variety of environments, including the street, a home or office, a public place, and inside a vehicle

Egyptian Arabic/English Word Alignment: BOLT Egyptian-English Word Alignment – Discussion Forum Training: developed by LDC, 400,448 words of Egyptian Arabic discussion forum data and English parallel text enhanced with linguistic tags to indicate word relations

Chinese AMR: Chinese Abstract Meaning Representation 1.0: developed by Brandeis University and Nanjing Normal University, semantic representations of 10,149 sentences from the weblog and discussion forum portions of Chinese Treebank 8.0 (LDC2013T21)

Amateur Videos for Event Detection Tasks: HAVIC MED Progress Test – Videos, Metadata and Annotation: developed by LDC for the HAVIC (Heterogeneous Audio Visual Internet Collection) project, 3,650 hours of web amateur videos with annotation and metadata to support event detection research