October 2023 Newsletter | Linguistic Data Consortium

New Corpora

AIDA Scenario 1 Practice Topic Source Data

AIDA Scenario 1 and 2 Reference Knowledge Base

Announcements

Membership Year 2024 Publication Preview
The 2024 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational telephone speech and web broadcasts, 65 hours transcribed
AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, Ukrainian, English, Spanish) for information and entity extraction
RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, Persian, Pushto, and Urdu audio files selected from RATS speech activity detection and keyword spotting data sets, also including communications systems sounds and silence
Call My Net 1: 364 hours of conversational telephone speech recordings in Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and China using various handsets under diverse noise conditions
Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 433 native speakers with transcripts
Diaspora Tibetan Speech: elicited, read and spontaneous speech from 73 native Tibetan speakers in Katmandu’s diaspora Tibetan community, some recordings transcribed
IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Bulgarian, Somali, Georgian)
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)

Check your inbox in the coming weeks for more information about membership renewal. 

Fall 2023 Data Scholarship Recipients
Congratulations to the recipients of LDC's Fall 2023 data scholarships:

Nessma Diab: Ain-Shams University (Egypt): Pre-PhD student, Linguistics. Nessma is awarded copies of CALLHOME Egyptian Arabic Speech LDC97S45 and CALLHOME Egyptian Arabic Transcripts LDC97T10 for her work in machine translation.

Soheir Elssakkout: Ain-Shams University (Egypt): PhD candidate. Soheir is awarded copies of Turkish Broadcast News and Transcripts LDC2012S06 and Middle East Technical University Turkish Microphone Speech v 1.0 LDC2006S33 for her work in speech recognition.

Metheus Franco: Witten/Herdecke University (Germany): Post-doctoral scholar, Faculty of Management, Economics and Society. Metheus is awarded a copy of Avocado Research Email Collection LDC2015T03 for his work in emotional foundations of dynamic capabilities.

Kamal Jarrar: Birzeit University (Palestine): Master’s student, Applied Statistics and Data Science Program. Kamal is awarded copies of Arabic Gigaword Fifth Edition LDC2011T11 and BOLT Arabic Discussion Forums LDC2018T10 for his work in part-of-speech tagging for dialectal Arabic.

Minkyoung Kim: Yonsei University (Korea); PhD candidate, Graduate School of Information. Minkyoung is awarded a copy of The New York Times Annotated Corpus LDC2018T19 for her work in event extraction and semantic event annotation.

Humaira Mehmood: Fatima Jinnah Women University (Pakistan): Master’s student, Computer Sciences. Humaira is awarded a copy of ARL Urdu Speech Database, Training Data LDC2007S03 for her work in machine translation.

Diyam Mousa: Birzeit University (Palestine): PhD candidate, Computer Science Department. Diyam is awarded copies of Arabic Treebank: Part 3 v. 3.2 LDC2010T08 and BOLT Egyptian Arabic Treebank – Discussion Forum LDC2018T23 for her work in morphological tagging for dialectal Arabic.

For information about the program, visit the Data Scholarships page.