October 2020 Newsletter

Thursday, October 15, 2020

New Publications

Global TIMIT Learner Treebank English 

Corpus of Law, Academic, and News 

IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b  


Fall 2020 Data Scholarship Recipients 
Congratulations to the recipients of LDC's Fall 2020 data scholarships:  

Nicole Dodd: University of California, Davis (USA); MA, Linguistics. Nicole is awarded a copy of Arabic Treebank Part 3 v. 3.2 LDC2010T08 for her research in relative clause processing in Standard Arabic. 
Satwik Dutta: University of Texas at Dallas (USA); PhD, Electrical Engineering. Satwik is awarded copies of The CMU Kids Corpus LDC97263 and CLSU: Kids’ Speech Version 1.1. LDC2007S18 for his work in speech activity detection.  
Pedram Hosseini: George Washington University (USA); PhD., Computer Science. Pedram is awarded copies of Penn Discourse Treebank Version 3.0 LDC2019T05 and The New York Times Annotated Corpus LDC2008T19 for his research in automatic detection of causal relations in text.  
Mariano Maisonnave: Universidad Nacional del Sur (Argentina); PhD, Computer Science. Mariano is awarded a copy of ACE 2005 Multilingual Training Corpus LDC2006T06 for his work in event extraction.  
Mark Sullivan: California State University, Los Angeles (USA); Masters, Applied and Advanced Studies in Education. Mark is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for his research in sentence boundary problems of Chinese L1 speakers in English compositions.  
For information about the program, visit the Data Scholarships page
Membership Year 2021 Publication Preview 
The 2021 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are: 

  • Global TIMIT Mandarin Chinese: 6,000 linguistically rich utterances featuring time-aligned lexical and phonetic transcription 
  • Columbia Games Corpus: 12 spontaneous task-oriented dyadic conversations elicited from native Standard American English speakers playing computer games, transcribed and annotated for discourse/pragmatic phenomena 
  • My Science Tutor Children’s Conversational Speech: 400+ hours of speech from 1,371 US third, fourth, and fifth grade students participating in sessions with a virtual science tutor, transcripts included 
  • The SSNCE Database of Tamil Dysarthric Speech: Tamil speech from 20 dysarthric speakers aged 12-40 years and a control group (10 speakers) with time-aligned phonetic transcripts 
  • Icelandic Parliamentary Speech: 6,493 Icelandic Parliament recordings from 2005-2016 with 196 speakers, aligned and segmented and divided into training, development, and evaluation sets for ASR development 
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools (Akan, Kinyarwanda, and Wolof) 
  • BOLT: co-reference, treebank, propbank, and translation resources for discussion forum, SMS/Chat, and conversational telephone speech data in all languages (Chinese, Egyptian Arabic, and English) 
  • TAC KBP: training and evaluation data for English surprise slot filling (2010) and English sentiment slot filling (2013-2014) tasks  

Check your inbox in the coming weeks for more information about membership renewal.  
LDC Data and Commercial Technology Development 
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.