October 2021 Newsletter

Thursday, October 14, 2021

New Publications

BOLT Egyptian Arabic Treebank – SMS/Chat 


Fall 2021 data scholarship recipients 
Congratulations to the recipients of LDC's Fall 2021 data scholarships: 

  • Sophia Minnillo: University of California, Davis (USA); PhD, Linguistics. Sophia is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for her research on the use of transition markers by Chinese L1 speakers. 
  • Jagabandhu Mishra: Indian Institute of Technology Dharwad (India); Research Scholar, Electrical Engineering. Jagabandhu is awarded a copy of Mandarin-English Code-Switching in South-East Asia LDC2015S04 for his work in spoken language diarization.  
  • Kashyap Patel: University of Texas at Dallas (USA); Ph.D., Electrical Engineering. Kashyap is awarded copies of CSR-I (WSJ0) Sennheiser LDC93S6B and CSR-II (WSJ1) Sennheiser LDC94S13B for his research in audio, acoustic and speech signal processing. 
  • Yoshani Ranaweera, D. Dissanayaka, S. Sudasinghe: University of Moratuwa (Sri Lanka); Bachelors, Computer Science and Engineering. This group is awarded a copy of CALLHOME American English Speech LDC97S4 for their work in speaker diarization.  
  • Winie Wong: University of Illinois at Chicago (USA); PhD, Electrical and Computer Engineering. Winie is awarded copies of ISI Chinese-English Automatically Extracted Parallel Text LDC2007T09 and GALE Phase 3 and 4 Chinese Broadcast News Parallel Text LDC2016T15 for her research in machine translation.  

For information about the program, visit the Data Scholarships page

Membership Year 2022 publication preview 
The 2022 Membership Year is approaching and plans for next year’s publications are in progress. Among the expected releases are: 

  • 2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation 

  • AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13 

  • Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names 

  • MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts  

  • HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task 

  • DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data 

  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof) 

Check your inbox in the coming weeks for more information about membership renewal.  

LDC data and commercial technology development 

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.