New Corpora

German Student Handwriting: H1 Children's Writing: developed by the Cooperative State University Baden-Württemberg, University of Education, 996 texts written over three months by 88 German elementary school children ages seven to eleven with metadata

User-generated Videos with Transcriptions: HAVIC Pilot Transcription: developed by LDC and NIST, approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos

English Proxy Reports: DEFT Narrative Text: developed by LDC, proxy reports intended to mimic the format and other features of some types of government analyst reports and their corresponding English newswire source documents

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2: 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2: the complete set of corresponding transcripts including 845,791 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences: 170 source-translation document pairs, comprising 44,064 words (Arabic Source) of translated data

GALE Phase 3 and 4 Arabic Web Parallel Text: 124 source-translation document pairs, comprising 61,662 tokens of Arabic source text and its English translation

GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text: 63 source-translation document pairs, comprising 487,466 tokens of Chinese source text and its English translation