New Corpora
Yoruba language resources for HLT development: LORELEI Yoruba Representative Language Pack: monolingual and parallel text with annotations and software tools, developed by LDC for the DARPA LORELEI program
Synthetic Icelandic speech: Samrómur Synthetic: developed by Reykjavik University, 72 hours of synthetic speech, 44 voices (22 male, 22 female) at four speed rates, totaling 220 speakers and 62,700 utterances (285 sentences/speaker)
Continuity annotations for Penn Treebank WSJ text: RST Continuity Corpus: building on the RST (rhetorical structure theory) framework with annotations of PTB WSJ articles for seven continuity dimensions and for polarity, order of segments, nuclearity, and context; developed at Åbo Akademi University and Humboldt-Universität zu Berlin
TACRED machine translation with projected annotations: MultiTACRED: developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab, English newswire and web text translated into 12 languages with projected entity annotations converted into XML-style markers; source data was developed by LDC for NIST TAC KBP 2009-2014 English slot filling tasks
Arabic read speech: L2-KSU Native and Non-Native Arabic Speech: 6 hours of speech from 80 subjects reading 10 sentences and repeating each sentence multiple times, including transcripts and speaker metadata, developed by King Saud University
Somali speech and annotations for cross language information retrieval: MATERIAL Somali-English Language Pack: developed by Appen for the IARPA MATERIAL program, 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval