COVID-19 Research


The COVID-19 pandemic has highlighted the importance of data-driven solutions to facilitate rapid response and humanitarian relief, and its global nature demonstrates the need for multi-language resources. To aid in this effort, LDC is releasing data it developed in the DARPA LORELEI program under a special no-cost license for COVID-19 research.

The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations such as natural disasters or disease outbreaks. Linguistic resources developed by LDC for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

These resources are available in a single corpus:

LDC2020E21 LORELEI Language Packs for COVID-19 Research

The corpus is divided into four sets, each available as its own zip file:

Incident languages: Uighur (IL3), Tigrinya (IL5), Oromo (IL6), Kinyarwanda (IL9), Sinhala (IL10), Odia (IL11), and Ilocano (IL12); LORELEI Entity Detection and Linking Knowledge Base

Representative Languages 1: Mandarin, Amharic, Arabic, Somali, Farsi, Vietnamese, Yoruba, Bengali, Hindi, Swahili, Indonesian, Tagalog, Tamil, Thai, Zulu, Akan, Wolof, Ukrainian, Uzbek and Hausa; LORELEI Entity Detection and Linking Knowledge Base

Representative Languages 2: Spanish, Hungarian, and Turkish; LORELEI Entity Detection and Linking Knowledge Base

Representative Languages 3: Russian; LORELEI Entity Detection and Linking Knowledge Base

The LDC COVID-19 License Agreement expires on June 30, 2021. Because most of the data has not undergone the level of quality control LDC normally applies to publications, all licensees must agree not to include any mention of errors in the data in any articles, published papers, reports, presentations and other documents describing the results of work performed under the LDC COVID-19 License Agreement. Instead licensees must agree to report any errors in the data directly to LDC.

To access this corpus, complete the LDC COVID-19 License Agreement and return a signed, scanned copy to LDC by email to LDC’s membership office, ldc@ldc.upenn.edu. Once the agreement is received and processed, instructions for accessing the data will be provided.