New Corpora

Egyptian Arabic dialogue act annotations: JANA: A Human-Human Dialogues Corpus for Egyptian Dialect: 82 transcribed call center dialogues (over 20k words) annotated for dialogue acts

Telephone speech in related languages: Multilanguage Conversational Telephone Speech – Slavic Group: 60 hours of Polish, Ukrainian and Russian telephone speech labeled for gender, dialect type and noise

Georgian Speech: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a: 190 hours of Georgian conversational and scripted telephone speech and corresponding transcripts collected in 2014-2015

Chinese-English parallel patent data: Chinese-English Parallel Sentences Extracted from Patents: 500k sentence pairs from over 300k patents from diverse fields; a special LDC release availaible under separate terms

English coreference and event annotations: Richer Event Description: coreference, bridging and event-event relations over 95 newswire, discussion forum and narrative text documents

Arabic text recognition data: KAFD: Arabic Font Database: over two million scanned Arabic texts from many sources in a variety of fonts, sizes and resolutions

Turkish Speech: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY: developed by Appen, 213 hours of Turkish conversational and scripted telephone speech and the corresponding transcripts collected in 2012 from speakers ages 16 to 70 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

Arabic Dependency Treebank:  ARL Arabic Dependency Treebank: derived from LDC's Arabic treebank series using constituency-to-dependency software developed by US Army Research Laboratory

Annotated Chinese discussion forum data:  BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training: ~450,000 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations

Pashto Speech: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY: developed by Appen, 214 hours of Pashto conversational and scripted telephone speech and the corresponding transcripts collected in 2011 and 2012 from speakers ages 17 to 70 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 3 and 4 Chinese Newswire Parallel Text: 367 source-translation document pairs, comprising 210,048 tokens (Chinese source) of translated data

GALE Phase 4 Arabic Broadcast News Parallel Sentences: 106 source-translation document pairs, comprising 114,251 words (Arabic source) of translated data