New Corpora

Central Asian Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Central Asian: developed by LDC, 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto, collected primarily to support research and technology evaluation in automatic language identification

Amharic Text Resources: LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text: developed by LDC, 25 million words of monolingual Amharic text – 600,000 of which are translated into English and another 80,000 words translated from English into Amharic – collected from discussion forums, news, reference, social network and weblog for  building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks

TAC KBP English Source Material: TAC KBP Comprehensive English Source Corpora 2009-2014: developed by LDC, 3,877,207 English source documents – newswire, broadcast material and web text – used  to support TAC KBP tasks from 2009-2014

Tok Pisin Speech: IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e: developed by Appen, 200 hours of Tok Pisin conversational and scripted telephone speech with transcripts collected in 2013 from speakers aged 16 years to 65 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Spanish Treebank Annotation: DEFT Spanish Treebank: developed by LDC and the Language and Computation Center at the University of Barcelona, 114 files (54,394 tokens) of international Spanish newswire data and 60 files (55,307 tokens) of Latin American Spanish discussion forum data annotated with constituents and syntactic functions

Apartment Speech with Background Noise: DIRHA English WSJ Audio: developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project, 85 hours of real and simulated read speech from WSJ text by six American English speakers collected in a real apartment setting with typical domestic background noise and inter/intra-room reverberation effects

Chinese-French Parallel Text: TRAD Chinese-French Parallel Text – Blog: developed by ELDA as part of the PEA-TRAD project, French translations of a subset of approximately 10,000 Chinese words from LDC’s GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06) collected for the TRAD project to develop speech-to-speech translation technology for multiple languages from a variety of domains

2007 CoNLL Shared Task: dependency treebanks used as part of the CoNLL 2007 shared task on multi-lingual dependency parsing and the domain adaptation track; the source data consists principally of various texts, e.g., textbooks, news, literature

Noisy WSJ Read Speech: CHiME3: developed as part of The 3rd CHiME Speech Separation and Recognition Challenge, 342 hours of read speech and transcripts from Wall Street Journal text recorded in noisy environments (on a bus, in a cafe, pedestrian area, and street junction) and noisy utterances generated by artificially mixing clean speech data with noisy backgrounds; 50 hours of noisy environment audio is also included

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts