Arabic Pronunciation Dictionary: Arabic Speech Recognition Pronunciation Dictionary : developed by the Qatar Computing Research Institute, two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word
Vietnamese Speech: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 : developed by Appen, 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 from speakers ages 16 to 64 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle
English Compound Function Words: MWE-Aware English Dependency Corpus : developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory, English compound function words annotated in dependency format, derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).
Bamanankan Lexicon: Bamanankan Lexicon : 5,978 entries of the Bamanankan language, an Eastern Manding language in the Mande Group of the Niger-Congo language family, presented as a Bamanankan-English lexicon and a Bamanankan-French lexicon using a Latin-based transcription system
Tagalog Speech: IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g:  developed by Appen, 213 hours of Tagolog conversational and scripted telephone speech and the corresponding transcripts collected in 2012 from speakers ages 16 to 65 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle
TAC KBP Spanish Cross-lingual Entity Linking: TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014 : training and evaluation data which includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Spanish newswire, discussion forum and web data
Egyptian Arabic dialogue act annotations: JANA: A Human-Human Dialogues Corpus for Egyptian Dialect:  82 transcribed call center dialogues (over 20k words) annotated for dialogue acts
Telephone speech in related languages: Multilanguage Conversational Telephone Speech – Slavic Group:  60 hours of Polish, Ukrainian and Russian telephone speech labeled for gender, dialect type and noise
Georgian Speech: IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a:  190 hours of Georgian conversational and scripted telephone speech and corresponding transcripts collected in 2014-2015
Chinese-English parallel patent data: Chinese-English Parallel Sentences Extracted from Patents : 500k sentence pairs from over 300k patents from diverse fields; a special LDC release availaible under separate terms
GALE Parallel, Word Aligned and Tagged Text: Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program
GALE Phase 3 and 4 Chinese Web Parallel Text : 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.
GALE Phase 4 Arabic Newswire Parallel Sentences: 393 source-translation document pairs drawn from six distinct newswire sources, comprising 62,669 tokens of Arabic source text and its English translation.