New Corpora

Annotated English Speech: Rhythm and Pitch: 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme, which permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers (used for annotating speech prosody) carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information

Concrete Annotation Schema: Concretely Annotated New York Times: developed by Johns Hopkins University's HLTCOE, 1.8 million articles from the New York Times Annotated Corpus (LDC2008T19) with multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations; includes multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization

German Children’s Handwriting: H2, E2, ERK1 Children's Writing: developed by the Cooperative State University Baden-Württemberg, University of Education, 2,000 texts by 173 German school children age six through eleven years written over four months in regular class settings with metadata about the school environment and the student participants

Arabic-French Parallel Text: TRAD Arabic-French Parallel Text -- Newsgroup: developed by ELDA as part of the PEA-TRAD project, French translations of a subset of approximately 10,000 Arabic words from LDC’s GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03); the purpose of  the TRAD project was to develop speech-to-speech translation technology for multiple languages from a variety of domains

Egyptian Arabic Informal Web Data: BOLT Arabic Discussion Forumsdeveloped by LDC, 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes

Somali Text Resources: LORELEI Somali Representative Language Pack - Monolingual and Parallel Textdeveloped by LDC, 13 million words of monolingual Somali text, – 800,000 of which are translated into English and another 100,000 words translated from English into Somali – collected from discussion forums, news, reference, social network and weblog for building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks

Annotated Parse Trees and Alignment: SPADE (Syntactic Phrase Alignment Dataset for Evaluation)annotated parse trees and alignment on English sentential paraphrases extracted from LDC data sets used in NIST’s OpenMT evaluation series and separated into development and test sets; contains 20,276 phrases extracted from 201 sentential paraphrases and 15,721 paraphrase alignments

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 4 Arabic Broadcast News Speech: 37 hours of Arabic broadcast news speech collected in 2008 and 2009

GALE Phase 4 Arabic Broadcast News Transcripts: the complete set of corresponding transcripts including 204,735 tokens in plain-text, tab delimited format with UTF-8 encoding