New Corpora

Bamanankan Lexicon: Bamanankan Lexicon: 5,978 entries of the Bamanankan language, an Eastern Manding language in the Mande Group of the Niger-Congo language family, presented as a Bamanankan-English lexicon and a Bamanankan-French lexicon using a Latin-based transcription system

Tagalog Speech:  IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g: developed by Appen, 213 hours of Tagolog conversational and scripted telephone speech and the corresponding transcripts collected in 2012 from speakers ages 16 to 65 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

TAC KBP Spanish Cross-lingual Entity Linking: TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014: training and evaluation data which includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Spanish newswire, discussion forum and web data

Egyptian Arabic dialogue act annotations: JANA: A Human-Human Dialogues Corpus for Egyptian Dialect: 82 transcribed call center dialogues (over 20k words) annotated for dialogue acts

Telephone speech in related languages: Multilanguage Conversational Telephone Speech – Slavic Group: 60 hours of Polish, Ukrainian and Russian telephone speech labeled for gender, dialect type and noise

Georgian Speech: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a: 190 hours of Georgian conversational and scripted telephone speech and corresponding transcripts collected in 2014-2015

Chinese-English parallel patent data: Chinese-English Parallel Sentences Extracted from Patents: 500k sentence pairs from over 300k patents from diverse fields; a special LDC release availaible under separate terms

English coreference and event annotations: Richer Event Description: coreference, bridging and event-event relations over 95 newswire, discussion forum and narrative text documents

Arabic text recognition data: KAFD: Arabic Font Database: over two million scanned Arabic texts from many sources in a variety of fonts, sizes and resolutions

Turkish Speech: IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY: developed by Appen, 213 hours of Turkish conversational and scripted telephone speech and the corresponding transcripts collected in 2012 from speakers ages 16 to 70 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 4 Arabic Newswire Parallel Sentences: 393 source-translation document pairs drawn from six distinct newswire sources, comprising 62,669 tokens of Arabic source text and its English translation.