New Corpora

Arabic Dependency Treebank:  ARL Arabic Dependency Treebank: derived from LDC's Arabic treebank series using constituency-to-dependency software developed by US Army Research Laboratory

Annotated Chinese discussion forum data:  BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training: ~450,000 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations

Pashto Speech: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY: developed by Appen, 214 hours of Pashto conversational and scripted telephone speech and the corresponding transcripts collected in 2011 and 2012 from speakers ages 17 to 70 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

Assamese Speech: IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a: developed by Appen, 205 hours of Assamese conversational and scripted telephone speech and the corresponding transcripts collected in 2012 and 2013 from speakers ages 16 to 66 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

Bengali Speech: IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b: developed by Appen, 215 hours of Bengali conversational and scripted telephone speech and the corresponding transcripts collected in 2011 and 2012 from speakers ages 16 to 65 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

English Conversation Transcripts: English Speed Networking Conversational Transcripts: developed by the University of the West of England, 388 transcripts of English face-to-face and instant messaging conversations about business ideas, collected in 2014 and 2015 from undergraduate students playing different power roles in order to examine the ways in which an individual's linguistic style is affected by social power and personality

American Southern English Speech: Digital Archive of Southern Speech - NLP Version (DASS-NLP): developed by LDC as an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03), 366 hours of English speech data and metadata collected as part of the Linguistic Atlas Project, converted into 16kHz, 16-bit flac compressed wav with normalized file names to facilitate automatic processing for human language technologies and natural language processing applications

Cantonese Speech: IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c: developed by Appen, 215 hours of Cantonese conversational and scripted telephone speech and the corresponding transcripts collected in 2011 from speakers ages 16 to 67 using  a variety of devices  in different environments, including the street, a home or office, a public place, and inside a vehicle

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Arabic Broadcast News Speech Part 1: 132 hours of Arabic broadcast news speech collected in 2007

GALE Phase 3 Arabic Broadcast News Transcripts Part 1: the complete set of corresponding transcripts including 741,689 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text: 76 source-translation document pairs, comprising 614,608 tokens of Chinese source text and its English translation

GALE Phase 4 Arabic Broadcast News Parallel Sentences: 106 source-translation document pairs, comprising 114,251 words (Arabic source) of translated data