New Corpora

CTS and BNBS in 20 languages for language recognition: 2015 NIST Language Recognition Evaluation Test Set: 867 hours of telephone speech and broadcast narrowband speech in 20 languages representing 6 clusters of related languages (Arabic, Spanish, English, Chinese, Slavic, and French), developed by LDC and NIST

Argumentative essays by students studying second languages: The Xi’an Multi-Language Learner Corpus: developed by Xi'an International Studies University, 526 argumentative essays in 15 languages by Chinese L1 university undergraduate students studying second languages, collected in 2023 and 2024  

English Russian and Spanish multimodal data with annotations interpreting events, situations and trends: AIDA Scenario 3 Practice Topic Source Data and Annotation: developed by LDC for the DARPA AIDA program, 1417 root files (text, image, video) from English, Russian, and Spanish web sources for a COVID-19 scenario annotated for relations, events, entities and claim frames 

Georgian speech and annotations for cross language information retrieval: MATERIAL Georgian-English Language Pack: 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval, developed by Appen for the IARPA MATERIAL program

Iraqi Arabic lexical resource: Iraqi Arabic - English Lexical Database: 67k + Iraqi Arabic words in Arabic script and IPA notation and 120k+ English tokens, developed by LDC in collaboration with Georgetown University Press to update and enhance 1960s  dictionaries 

Hungarian language resources for HLT development: LORELEI Hungarian Representative Language Pack: monolingual and parallel text with annotations and software tools, developed by LDC for the DARPA LORELEI program