New Corpora

Mandarin Chinese Telephone Speech: CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition: 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in Mainland China, with audio files in .wav format, a simplified directory structure and additional documentation and metadata

Degraded Audio for Language Identification (LID): RATS Language Identification: developed by LDC, 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech recorded over eight transceiver configurations with annotation of speech segments including start time, end time, speech activity detection (SAD) label, SAD provenance, LID and LID provenance created for the LID task in the DARPA RATS program

Chinese-French Parallel Text: TRAD Chinese-French Parallel Text – Broadcast News: developed by ELDA as part of the PEA-TRAD project, French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 (LDC2008T18)

Chinese Informal Text: BOLT Chinese SMS/Chat14,877 Chinese SMS and Chat conversations totaling 3,005,810 words across 497,543 messages collected by LDC through data donations and live collection for the BOLT program

Central European Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Central European: 44 hours of telephone speech in two distinct language varieties of Central Europe- Czech and Slovak - collected by LDC primarily to support research and technology evaluation in automatic language identification

TAC KBP Evaluation and Training Data: TAC KBP Entity Linkingtraining and evaluation data produced by LDC in support of the TAC KBP English Entity Linking tasks from 2009-2013 including queries and gold standard entity type information, Knowledge Base links, equivalence class clusters for NIL entities, source documents for the queries – English newswire, discussion forum, and web data – along with the results of an Entity Linking Inter-Annotator Agreement study conducted in 2010

Cebuano Speech: IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0bdeveloped by Appen, 191 hours of Cebuano conversational and scripted telephone speech with transcripts collected in 2013 and 2014 from speakers aged 16 to 75 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Annotated English Speech: Rhythm and Pitch: 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme, which permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers (used for annotating speech prosody) carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 4 Arabic Broadcast News Speech: 37 hours of Arabic broadcast news speech collected in 2008 and 2009

GALE Phase 4 Arabic Broadcast News Transcripts: the complete set of corresponding transcripts including 204,735 tokens in plain-text, tab delimited format with UTF-8 encoding