New Corpora

English Discussion Forum Treebank: BOLT English Treebank - Discussion Forum: developed by LDC, 268,907 tokens of English web discussion forum data with part-of-speech and syntactic structure annotations following Penn Treebank II style as revised by updated treebank guidelines

Polish Read Speech and Transcripts: Polish Speech Database: developed by VoiceLab, 280 hours (263,424 utterances) of Polish speech data from 200 speakers with corresponding transcripts, recordings by speakers reading from their home computers

SRE16 Development, Evaluation Data: 2016 NIST Speaker Recognition Evaluation Test Set: developed by LDC and NIST, 340 hours of short segments of Tagalog, Cantonese, Cebuano and Mandarin telephone speech constituting SRE16 development and test data with trial lists, associated keys, metadata, and evaluation documentation; source data from LDC’s Call My Net 2015 Corpus where native speakers used different handsets in various acoustic settings

Enhanced CALLFRIEND Canadian French: CALLFRIEND Canadian French Second Edition: developed by LDC, 26 hours of unscripted telephone conversations between native speakers of Canadian French, updated with audio files in .wav format, a simplified directory structure and additional documentation and metadata

Chinese/English Annotated SMS/Chat: BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training: developed by LDC, 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations

Sports Domain Data for Reading System Development: Machine Reading Phase 1 NFL Scoring Training Data: developed by LDC, 110 U.S. National Football League scoring source documents and 110 standoff annotation files defined with respect to an ontology, constituting the training data for the NFL Scoring Use Cases evaluation in the DARPA Machine Reading program

Transcribed Persian Speech: Corpus of Conversational Persian Transcripts: transcripts of 20 hours of informal telephone and face-to-face conversations in the Tehrani dialect of Iranian Persian, annotated for gender, age, recording method and setting

TAC KBP Evaluation Documents: TAC KBP Evaluation Source Corpora 2016-2017: 180,003 Chinese, English and Spanish discussion forum and newswire texts collected by LDC for the 2016 and 2017 TAC KBP evaluation tracks, with resources for recreating specific test sets

East Asian Telephone Collection: Multi-Language Conversational Telephone Speech 2011 -- East Asian: 19 hours of telephone speech in Thai and Lao, labeled for gender, dialect type and noise, collected by LDC to support automatic language identification

Igbo Speech: IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c: developed by Appen, 207 hours of Igbo conversational and scripted telephone speech with transcripts collected in 2014 and 2015 from speakers aged 16 to 67 years old using different telephones in various environments, including the street, a home or office, a public place, and inside a vehicle