New Corpora

Tunisian-Arabic CTS for speaker identification: 2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge: a turnkey corpus with 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files and documentation

Zulu language resources for HLT development: LORELEI Zulu Representative Language Pack: monolingual and parallel text with entity linking and detection annotation and situation frame analysis, developed by LDC for the DARPA LORELEI program

Dependency annotation for Korean newswire data: Penn Korean Universal Dependency Treebank: 5010 sentences and 132,041 tokens from Korean Press Agency newswire stories annotated in universal dependency format, converted from Korean Treebank Annotations Version 2.0 LDC2006T09, a constituency treebank

Various English text annotated for entities, relations, events: DEFT English Light and Rich ERE Annotation: 1190 English discussion forum, newswire, and proxy documents with expanded ERE mark-up under light and rich annotation guidelines, developed by LDC for the DARPA DEFT program

Multi-language telephone speech for speaker and language ID: Mixer 3 Speech: 3,200 hours of conversational telephone speech, 3,875 speakers, 19,595 recordings, and 26 distinct languages collected by LDC from 2005-2007 as part of the Mixer project, portions used in NIST SRE and LRE corpora

Tamil language resources for HLT development: LORELEI Tamil Representative Language Pack: monolingual and parallel text with entity linking and detection annotation and situation frame analysis, developed by LDC for the DARPA LORELEI program