New Corpora

Assamese Speech: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a: developed by Appen, 205 hours of Assamese conversational and scripted telephone speech and the corresponding transcripts collected in 2012 and 2013 from speakers ages 16 to 66 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

Bengali Speech: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b: developed by Appen, 215 hours of Bengali conversational and scripted telephone speech and the corresponding transcripts collected in 2011 and 2012 from speakers ages 16 to 65 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

English Conversation Transcripts: English Speed Networking Conversational Transcripts: developed by the University of the West of England, 388 transcripts of English face-to-face and instant messaging conversations about business ideas, collected in 2014 and 2015 from undergraduate students playing different power roles in order to examine the ways in which an individual's linguistic style is affected by social power and personality

American Southern English Speech: Digital Archive of Southern Speech - NLP Version (DASS-NLP): developed by LDC as an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03), 366 hours of English speech data and metadata collected as part of the Linguistic Atlas Project, converted into 16kHz, 16-bit flac compressed wav with normalized file names to facilitate automatic processing for human language technologies and natural language processing applications

Cantonese Speech: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c: developed by Appen, 215 hours of Cantonese conversational and scripted telephone speech and the corresponding transcripts collected in 2011 from speakers ages 16 to 67 using  a variety of devices  in different environments, including the street, a home or office, a public place, and inside a vehicle

New Chinese Treebank release: Chinese Treebank 9.0: 2,084,387 words annotated and parsed from various Chinese text sources including chat messages, transcribed telephone speech, newswire, government documents, magazine articles, weblogs, discussion forums and more, provided in four formats: raw text, word segmented, POS-tagged, and syntactically bracketed

Mexican Spanish Speech: CHM150: developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico, 1.63 hours of Mexican Spanish microphone speech, associated transcripts and speaker metadata

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Arabic Broadcast News Speech Part 1: 132 hours of Arabic broadcast news speech collected in 2007

GALE Phase 3 Arabic Broadcast News Transcripts Part 1: the complete set of corresponding transcripts including 741,689 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text: 76 source-translation document pairs, comprising 614,608 tokens of Chinese source text and its English translation

GALE Phase 4 Arabic Weblog Parallel Sentences: 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data