New Corpora

English Legal Briefs: First-Year Law Students' Court Memoranda: 197 law student English writing samples annotated with accompanying survey responses by student writers created to apply natural language processing approaches to determine any differences in the briefs' language attributable to the students' self-reported genders

Haitian Creole Speech: IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b: developed by Appen, 203 hours of Haitian Creole conversational and scripted telephone speech collected in 2012 and 2013 from speakers ages 16 to 75 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

Arabic Pronunciation Dictionary: Arabic Speech Recognition Pronunciation Dictionary: developed by the Qatar Computing Research Institute, two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word

Vietnamese Speech: IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7: developed by Appen, 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 from speakers ages 16 to 64 using a variety of telephones in different environments, including the street, a home or office, a public place, and inside a vehicle

English Compound Function Words: MWE-Aware English Dependency Corpus: developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory, English compound function words annotated in dependency format, derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).

Bamanankan Lexicon: Bamanankan Lexicon: 5,978 entries of the Bamanankan language, an Eastern Manding language in the Mande Group of the Niger-Congo language family, presented as a Bamanankan-English lexicon and a Bamanankan-French lexicon using a Latin-based transcription system

Tagalog Speech:  IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g: developed by Appen, 213 hours of Tagolog conversational and scripted telephone speech and the corresponding transcripts collected in 2012 from speakers ages 16 to 65 using a variety of devices in different environments, including the street, a home or office, a public place, and inside a vehicle

TAC KBP Spanish Cross-lingual Entity Linking: TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014: training and evaluation data which includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Spanish newswire, discussion forum and web data

GALE Broadcast Collection: Arabic/Chinese broadcast speech collected by LDC for the DARPA GALE program with associated transcripts

GALE Phase 3 Arabic Broadcast News Speech Part 2: 128 hours of Arabic broadcast news speech collected in 2007

GALE Phase 3 Arabic Broadcast News Transcripts Part 2: the complete set of corresponding transcripts including 721,846 tokens in plain-text, tab delimited format with UTF-8 encoding

GALE Parallel, Word Aligned and Tagged Text:  Arabic/Chinese and English parallel, word aligned and tagged resources LDC developed for the DARPA GALE program

GALE Phase 3 and 4 Chinese Web Parallel Text: 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.