New Corpora

English Conversation Transcripts: English Speed Networking Conversational Transcripts: developed by the University of the West of England, 388 transcripts of English face-to-face and instant messaging conversations about business ideas, collected in 2014 and 2015 from undergraduate students playing different power roles in order to examine the ways in which an individual's linguistic style is affected by social power and personality

Southern English Speech: Digital Archive of Southern Speech - NLP Version (DASS-NLP): developed by LDC as an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03), 366 hours of English speech data and metadata collected as part of the Linguistic Atlas Project, converted into 16kHz, 16-bit flac compressed wav with normalized file names to facilitate automatic processing for human language technologies and natural language processing applications

Cantonese Speech: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c: developed by Appen, 215 hours of Cantonese conversational and scripted telephone speech and the corresponding transcripts collected in 2011 from speakers ages 16 to 67 using  a variety of devices  in different environments, including the street, a home or office, a public place, and inside a vehicle

New Chinese Treebank release: Chinese Treebank 9.0: 2,084,387 words annotated and parsed from various Chinese text sources including chat messages, transcribed telephone speech, newswire, government documents, magazine articles, weblogs, discussion forums and more, provided in four formats: raw text, word segmented, POS-tagged, and syntactically bracketed

Mexican Spanish Speech: CHM150: developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico, 1.63 hours of Mexican Spanish microphone speech, associated transcripts and speaker metadata

Semantic Dependency Parsing: ­­SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing: data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) for Chinese, Czech and English, conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and developed by the SDP task organizers

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 4 Chinese Broadcast Conversation Speech: 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008

GALE Phase 4 Chinese Broadcast Conversation Transcripts: the complete set of corresponding transcripts including 2,259,952 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text: 76 source-translation document pairs, comprising 614,608 tokens of Chinese source text and its English translation

GALE Phase 4 Arabic Weblog Parallel Sentences: 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data