New Corpora

Ukrainian speech and transcripts: AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news audio with 1.2 million words of corresponding orthographic transcripts, developed by LDC for the DARPA AIDA program

Swahili language resources for HLT development: LORELEI Swahili Representative Language Pack: monolingual and parallel text with entity linking and detection annotation and situation frame analysis, developed by LDC for the DARPA LORELEI program

Multi-language images and annotations for OCR: CAMIO Transcription Languages: developed by LDC, 70K images of machine printed text, annotations and transcripts in 13 languages for OCR and related technology development

Thai speech and transcripts: Global TIMIT Thai: 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers reading 120 selected sentences (6000 total utterances), modeled after the classic English TIMIT study

Diarization for challenging speech data: Third DIHARD Challenge DevelopmentThird DIHARD Challenge Evaluation: 34 and 33 hours, respectively, of English and Chinese speech data and annotations (diarization, segmentation) developed by LDC for the Third DIHARD Challenge, source data includes monologues, interviews, meeting speech, clinical recordings and amateur web videos, among others

English translation SMS/Chat treebank: BOLT English Translation Treebank – Egyptian Arabic SMS/Chat: developed by LDC for the DARPA BOLT program, 98,206 tokens of translated Egyptian Arabic text annotated for part-of-speech and syntactic structure in Penn Treebank II style

Icelandic kids prompted speech: Samrómur Children Icelandic Speech 1.0, 131 hours of Icelandic prompted speech from 3,175 speakers (children, aged 4-17 years) representing 137,597 utterances, developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology