New Corpora

Oromo Text, Annotations and Tools for Rapid Event Response: LORELEI Oromo Incident Language Pack: developed by LDC, contains all text data, annotations, supplemental resources and software tools for the Oromo language used in the DARPA LORELEI / LoReHLT 2017 Evaluation for building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks

Knowledge Base for LORELEI Entity Annotation: LORELEI Entity Detection and Linking Knowledge Base: the Knowledge Base (KB) developed by LDC for all LORELEI language pack entity linking annotation; drawn from GeoNames, the CIA World Leaders List, the CIA World Factbook and supplemented with manually-created KB entries

Translated Discussion Forum Treebank: BOLT English Translation Treebank - Chinese Discussion Forum: developed by LDC, 147,432 tokens of web discussion forum data translated from Chinese to English with part-of-speech and syntactic structure annotations following Penn Treebank II style

Chinese Telephone Collection: Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese: 25 hours of Mandarin Chinese telephone speech, manually labeled for gender, dialect type & noise, collected by LDC to support automatic language identification

SRE18 Development, Test Data: 2018 NIST Speaker Recognition Evaluation Test Set: developed by LDC and NIST, 396 hours of Tunisian Arabic telephone recordings and English web video speech with answer keys, trial and train files and documentation; speech data includes VOIP and audio from video for new training and test conditions   

Translated AMR Sentences: Abstract Meaning Representation 2.0 - Four Translations: developed by University of Edinburgh, School of Informatics, Spanish, German, Italian and Chinese translations of 5,484 sentences (1,371 sentences per language) from AMR 2.0 LDC2017T10, a semantic treebank of over 39,000 English sentences from broadcast conversations, newswire, and web text  

Temporal Analysis of English Text: TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013: queries, manual runs and final assessment results developed by LDC for the task of identifying and capturing temporal information in text  

Egyptian Arabic/English Annotated Telephone SpeechBOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training: developed by LDC, 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations; source data consists of translated transcripts from LDC's CALLHOME and CALLFRIEND Egyptian Arabic collection (LDC97S45,  LDC97T19LDC2002S37LDC2002T38LDC96S49

New Mixer ReleaseMixer 4 and 5 Speech: developed by LDC, 14,185 hours of cross-channel audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings from 616 American English speakers, collected in 2007 and used in the 2008 NIST Speaker Recognition Evaluation

Training and Evaluation Data for Distributional Semantic ModelsEVALution: developed by The Hong Kong Polytechnic University, English and Mandarin Chinese data sets -- EVALution 1.0 and EVALution-Man -- containing semantic relations and metadata for training and evaluating distributional semantic models