New Corpora

English Discussion Forums Annotated for Committed Belief: DEFT English Committed Belief Annotation: developed by LDC, 950,000 words of English discussion forum text annotated to mark the level of commitment displayed by the author to the truth of the propositions expressed in the text

Enhanced CALLFRIEND American English-Non-Southern Dialect: CALLFRIEND American English-Non-Southern Dialect Second Edition: developed by LDC, 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English, updated with audio files in .wav format, a simplified directory structure and additional documentation and metadata

TAC KBP Cold Start Evaluation Documents: TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017: Chinese, English and Spanish newswire and web text source documents, queries, assessments, manual runs and final assessments developed by LDC for the Cold Start task which required systems to identify all entities and to build a knowledge base of the relations between them

Amharic Speech: IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b: developed by Appen, 204 hours of Amharic conversational and scripted telephone speech with transcripts collected in 2014 from speakers aged 16 to 67 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle

English Discussion Forum Treebank: BOLT English Treebank - Discussion Forum: developed by LDC, 268,907 tokens of English web discussion forum data with part-of-speech and syntactic structure annotations following Penn Treebank II style as revised by updated treebank guidelines

Polish Read Speech and Transcripts: Polish Speech Database: developed by VoiceLab, 280 hours (263,424 utterances) of Polish speech data from 200 speakers with corresponding transcripts, recordings by speakers reading from their home computers

SRE16 Development, Evaluation Data: 2016 NIST Speaker Recognition Evaluation Test Set: developed by LDC and NIST, 340 hours of short segments of Tagalog, Cantonese, Cebuano and Mandarin telephone speech constituting SRE16 development and test data with trial lists, associated keys, metadata, and evaluation documentation; source data from LDC’s Call My Net 2015 Corpus where native speakers used different handsets in various acoustic settings

Enhanced CALLFRIEND Canadian French: CALLFRIEND Canadian French Second Edition: developed by LDC, 26 hours of unscripted telephone conversations between native speakers of Canadian French, updated with audio files in .wav format, a simplified directory structure and additional documentation and metadata

Chinese/English Annotated SMS/Chat: BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training: developed by LDC, 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations

Sports Domain Data for Reading System Development: Machine Reading Phase 1 NFL Scoring Training Data: developed by LDC, 110 U.S. National Football League scoring source documents and 110 standoff annotation files defined with respect to an ontology, constituting the training data for the NFL Scoring Use Cases evaluation in the DARPA Machine Reading program