New Corpora

Guangzhong Mandarin speech citizen science collection: Xi’an Guanzhong Object Naming: 15 hours of recordings from native speakers in Shaangxi Province (China) naming images from the MultiPic dataset, collected from a closed volunteer community using LanguageArc, a citizen science portal developed by LDC

Synthetized Maltese speech: MASRI Synthetic: developed by the University of Malta, 99 hours of synthesized Maltese speech based on text from various genres, created with 210 voices (105 female, 105 male)

Amateur web video for multimodal event detection: HAVIC MED Novel 2 Test – Videos, Metadata and Annotation: 6,200 hours annotated with event properties and topic and genre categories, developed by LDC for the 2015 NIST-sponsored MED (Multimedia Event Detection) task

Arabic argumentative essays: Qatari Corpus of Argumentative Writing, developed by Qatar UniversityUniversity of Exeter and Hamad Bin Khalifa University, 200,000 tokens of Arabic and English writing by undergraduate students (159 female, 36 male) responding to argumentative prompts, with annotations and related metadata

English child language recordings with diarization annotation: Second DIHARD Challenge Evaluation - SEEDLingS, two hours of English child language recordings developed by Duke University and annotated by LDC for the first and second DIHARD challenges