Linguistic Data Consortium
Published on Linguistic Data Consortium (https://www.ldc.upenn.edu)


New Corpora

English read speech from specially-constructed stories: NUBUC [1] (NyU-BU contextually controlled stories Corpus), developed by New York University [2], Max Planck Institute for Empirical Aesthetics [3] and Boston University [4], eight stories with keywords for linguistic analysis each read by two speakers (one female, one male) with transcripts, syntactic annotations and corpus metadata

Icelandic prompted speech: Samrómur Icelandic Speech 1.0 [5], 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances, developed by the Language and Voice Lab, Reykjavik University [6] in cooperation with Almannarómur, Center for Language Technology [7]

Wolof language resources for HLT development: LORELEI Wolof Representative Language Pack [8] – monolingual and parallel text with entity linking and detection annotation and situation frame analysis, developed by LDC for the DARPA LORELEI program

Arabic newswire annotated for attribution: AttImam [9], developed by Al-Imam Mohammad Ibn Saud Islamic University [10], 2,000 attribution relations manually applied to Agence France Presse text from Arabic Treebank: Part 1 v 4.1 (LDC2010T13) [11]

Amateur web video for multimodal event detection: HAVIC MED Novel 1 Test – Videos, Metadata and Annotation [12]: 3,800 hours annotated with event properties and topic and genre categories developed by LDC for the 2015 NIST-sponsored MED (Multimedia Event Detection) task

 


Source URL: https://www.ldc.upenn.edu/new-corpora

Links
[1] https://catalog.ldc.upenn.edu/LDC2022S04
[2] https://www.nyu.edu/
[3] https://www.mpg.de/6971390/empirical-aesthetics
[4] https://www.bu.edu/
[5] https://catalog.ldc.upenn.edu/LDC2022S05
[6] https://lvl.ru.is/
[7] https://almannaromur.is/
[8] https://catalog.ldc.upenn.edu/LDC2022T03
[9] https://catalog.ldc.upenn.edu/LDC2022T02
[10] https://imamu.edu.sa/en/Pages/default.aspx
[11] https://catalog.ldc.upenn.edu/LDC2010T13
[12] https://catalog.ldc.upenn.edu/LDC2022V01