New Corpora

New Chinese Treebank release: Chinese Treebank 9.0: 2,084,387 words annotated and parsed from various Chinese text sources including chat messages, transcribed telephone speech, newswire, government documents, magazine articles, weblogs, discussion forums and more, provided in four formats: raw text, word segmented, POS-tagged, and syntactically bracketed

Mexican Spanish Speech: CHM150: developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico, 1.63 hours of Mexican Spanish microphone speech, associated transcripts and speaker metadata

Semantic Dependency Parsing: ­­SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing: data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) for Chinese, Czech and English, conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and developed by the SDP task organizers

German Student Handwriting: H1 Children's Writing: developed by the Cooperative State University Baden-Württemberg, University of Education, 996 texts written over three months by 88 German elementary school children ages seven to eleven with metadata

User-generated Videos with Transcriptions: HAVIC Pilot Transcription: developed by LDC and NIST, approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 4 Chinese Broadcast Conversation Speech: 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008

GALE Phase 4 Chinese Broadcast Conversation Transcripts: the complete set of corresponding transcripts including 2,259,952 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 4 Arabic Weblog Parallel Sentences: 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences: 170 source-translation document pairs, comprising 44,064 words (Arabic Source) of translated data