New Corpora

Concrete Annotation Schema: Concretely Annotated English Gigaword: developed by Johns Hopkins University's Human Language Technology Center of Excellence,  adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to English Gigaword Fifth Edition (LDC2011T07)

TAC KBP Evaluation and Training Data: TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014: developed by LDC for the Slot Filling evaluation track  focused on  mining information about entities from text, includes queries, manual runs and assessment results

Arabic-French Parallel  Text: TRAD Arabic-French Parallel Text -- Newswire: developed by ELDA for the PEA-TRAD project, French translations of 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21)

Arabic, Chinese & English Web Text for Information Retrieval: BOLT Information Retrieval Comprehensive Training and Evaluation: all data produced by LDC in support of the DARPA BOLT IR task including annotations, source documents and scoring software

Amateur Web Video: HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation: 53 hours of user-generated videos with annotation and metadata developed for the HAVIC project and the NIST-sponsored Multimedia Event Detection task

Spanish Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Spanish: 23 hours of telephone speech in Spanish collected by LDC to support research and technology evaluation in automatic language identification

Kazakh Speech: IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a: developed by Appen, 203 hours of Kazakh conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 64 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

English Informal Text: BOLT English SMS/Chat: 18,429 English SMS and Chat conversations totaling 3,674,802 words across 375,967 messages collected through data donations and live collection for the BOLT program

Mexican Spanish Broadcast Speech: CIEMPIESS Balance18 hours of Mexican Spanish broadcast speech and transcripts to present a gender-balanced collection together with CIEMPIESS Light (LDC2017S23)

2011 LRE Speech with the Language Pair Condition: 2011 NIST Language Recognition Evaluation Test Set: collected by LDC, 204 hours of conversational telephone speech and broadcast narrow band audio in 24 languages and dialects; includes selected training data in addition to the test set