Disaster and Refugee Relief Research

LDC is releasing Ukrainian data it developed in the DARPA AIDA program, the NIST Language Recognition Evaluation (LRE) series and the DARPA LORELEI program under a special no-cost, limited license for disaster and refugee relief research.

The AIDA (Active Interpretation of Disparate Alternatives) program aims to develop a multi-hypothesis semantic engine that generates explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supports AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

The goal of NIST’s LRE series is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. 

The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations such as natural disasters or disease outbreaks. Linguistic resources developed by LDC for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

The following corpora are available under the LDC Disaster and Refugee Relief License Agreement: 

LDC2022E06 AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
This corpus contains approximately 156 hours of Ukrainian conversational telephone speech and broadcast news with corresponding orthographic transcripts. The broadcast material was collected to support the AIDA program. The telephone recordings were used in the NIST 2011 LRE which focused on pair discrimination for 24 languages/dialects and is also contained in LDC2016S11 Multi-Language Conversational Telephone Speech 2011 – Slavic Group.  

LDC2020T24 LORELEI Ukrainian Representative Language Pack
Source data was collected from discussion forum, news, reference, social network and weblog. The corpus contains 111 million words of monolingual text (700,000 words translated into English); 86,000 Ukrainian words translated from English data; 174,000 words of parallel text; and 2,000,000+ words of comparable text. Around 75,000 words were annotated for named entities and up to 50,000 words contain additional annotation, including situation frames and entity detection and linking.

LDC2020T10 LORELEI Entity Detection and Linking Knowledge Base
This data set contains the full Entity Detection and Linking Knowledge Base used for entity linking annotation in all LORELEI language packs. The knowledge base content was drawn from GeoNames, the CIA World Leaders List and the CIA World Factbook and was supplemented with manually created entries developed specifically for LORELEI data. 

The LDC Disaster and Refugee Relief License Agreement expires on December 31, 2022

To access the corpora described above, complete the LDC Disaster and Refugee Relief License Agreement and return a signed, scanned copy to LDC by email, ldc@ldc.upenn.edu. Once the agreement is received and processed, instructions for accessing the data will be provided.