New Corpora

Arabic, Chinese & English Web Text for Information Retrieval: BOLT Information Retrieval Comprehensive Training and Evaluation: all data produced by LDC in support of the DARPA BOLT IR task including annotations, source documents and scoring software

Amateur Web Video: HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation: 53 hours of user-generated videos with annotation and metadata developed for the HAVIC project and the NIST-sponsored Multimedia Event Detection task

Spanish Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Spanish: 23 hours of telephone speech in Spanish collected by LDC to support research and technology evaluation in automatic language identification

Kazakh Speech: IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a: developed by Appen, 203 hours of Kazakh conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 64 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

English Informal Text: BOLT English SMS/Chat: 18,429 English SMS and Chat conversations totaling 3,674,802 words across 375,967 messages collected through data donations and live collection for the BOLT program

Mexican Spanish Broadcast Speech: CIEMPIESS Balance: 18 hours of Mexican Spanish broadcast speech and transcripts to present a gender-balanced collection together with CIEMPIESS Light (LDC2017S23)

2011 LRE Speech with the Language Pair Condition: 2011 NIST Language Recognition Evaluation Test Set: collected by LDC, 204 hours of conversational telephone speech and broadcast narrow band audio in 24 languages and dialects; includes selected training data in addition to the test set

Mandarin Chinese Telephone Speech: CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition: 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in Mainland China, with audio files in .wav format, a simplified directory structure and additional documentation and metadata

Degraded Audio for Language Identification (LID): RATS Language Identification: developed by LDC, 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech recorded over eight transceiver configurations with annotation of speech segments including start time, end time, speech activity detection (SAD) label, SAD provenance, LID and LID provenance created for the LID task in the DARPA RATS program

Chinese-French Parallel Text: TRAD Chinese-French Parallel Text – Broadcast News:developed by ELDA as part of the PEA-TRAD project, French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 (LDC2008T18)