New Corpora

Diarization for challenging speech data: Second DIHARD Challenge Evaluation - Eleven Sources, 20 hours of English and Chinese speech data and annotations (diarization, segmentation) developed by LDC for the Second DIHARD Challenge, source data includes monologues, interviews, meeting speech and clinical recordings, among others

English read speech from specially-constructed stories: NUBUC (NyU-BU contextually controlled stories Corpus), developed by New York UniversityMax Planck Institute for Empirical Aesthetics and Boston University, eight stories with keywords for linguistic analysis each read by two speakers (one female, one male) with transcripts, syntactic annotations and corpus metadata

Icelandic prompted speech: Samrómur Icelandic Speech 1.0, 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances, developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology

Wolof language resources for HLT development: LORELEI Wolof Representative Language Pack – monolingual and parallel text with entity linking and detection annotation and situation frame analysis, developed by LDC for the DARPA LORELEI program