New Corpora

Concrete Annotation Schema: Concretely Annotated New York Times: developed by Johns Hopkins University's HLTCOE, 1.8 million articles from the New York Times Annotated Corpus (LDC2008T19) with multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations; includes  multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization

German Children’s Handwriting: H2, E2, ERK1 Children's Writing: developed by the Cooperative State University Baden-Württemberg, University of Education, 2,000 texts by 173 German school children age six through eleven years written over four months in regular class settings with metadata about the school environment and the student participants

Arabic-French Parallel Text: TRAD Arabic-French Parallel Text -- Newsgroup: developed by ELDA as part of the PEA-TRAD project, French translations of a subset of approximately 10,000 Arabic words from LDC’s GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03); the purpose of  the TRAD project was to develop speech-to-speech translation technology for multiple languages from a variety of domains

Egyptian Arabic Informal Web Data: BOLT Arabic Discussion Forumsdeveloped by LDC, 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes

Somali Text Resources: LORELEI Somali Representative Language Pack - Monolingual and Parallel Textdeveloped by LDC, 13 million words of monolingual Somali text, – 800,000 of which are translated into English and another 100,000 words translated from English into Somali – collected from discussion forums, news, reference, social network and weblog for building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks

Annotated Parse Trees and Alignment: SPADE (Syntactic Phrase Alignment Dataset for Evaluation)annotated parse trees and alignment on English sentential paraphrases extracted from LDC data sets used in NIST’s OpenMT evaluation series and separated into development and test sets; contains 20,276 phrases extracted from 201 sentential paraphrases and 15,721 paraphrase alignments

Central Asian Telephone Speech: Multi-Language Conversational Telephone Speech 2011 -- Central Asiandeveloped by LDC, 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto, collected primarily to support research and technology evaluation in automatic language identification

Amharic Text Resources: LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text: developed by LDC, 25 million words of monolingual Amharic text – 600,000 of which are translated into English and another 80,000 words translated from English into Amharic – collected from discussion forums, news, reference, social network and weblog for  building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks

TAC KBP English Source Material: TAC KBP Comprehensive English Source Corpora 2009-2014developed by LDC, 3,877,207 English source documents – newswire, broadcast material and web text – used  to support TAC KBP tasks from 2009-2014

Tok Pisin Speech: IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0edeveloped by Appen, 200 hours of Tok Pisin conversational and scripted telephone speech with transcripts collected in 2013 from speakers aged 16 years to 65 years old using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle