New Corpora | Linguistic Data Consortium

New Corpora

Iraqi Arabic lexical resource: Iraqi Arabic - English Lexical Database: 67k + Iraqi Arabic words in Arabic script and IPA notation and 120k+ English tokens, developed by LDC in collaboration with Georgetown University Press to update and enhance 1960s dictionaries

Hungarian language resources for HLT development: LORELEI Hungarian Representative Language Pack: monolingual and parallel text with annotations and software tools, developed by LDC for the DARPA LORELEI program

Farsi speech and annotations for cross language information retrieval: MATERIAL Farsi-English Language Pack: developed by Appen for the IARPA MATERIAL program, 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval

AMR 3.0 automatic translations: Abstract Meaning Representation 3.0 - Machine Translations: AMR 3.0 training, development and test splits translated into Spanish, Irish Gaelic, and Dutch using Google Translate, developed at KU Leuven

Yoruba language resources for HLT development: LORELEI Yoruba Representative Language Pack: monolingual and parallel text with annotations and software tools, developed by LDC for the DARPA LORELEI program

Synthetic Icelandic speech: Samrómur Synthetic: developed by Reykjavik University, 72 hours of synthetic speech, 44 voices (22 male, 22 female) at four speed rates, totaling 220 speakers and 62,700 utterances (285 sentences/speaker)