November 2021 Newsletter

Monday, November 15, 2021

New Publications

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Second DIHARD Challenge Development – Eleven Sources

Second DIHARD Challenge Development - SEEDLingS


Join LDC for Membership Year 2022 
Membership Year 2022 (MY2022) is open and discounts are available for those who keep their membership current and join early. Current MY2021 members who renew their LDC membership before March 1, 2022 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data from our Catalog of 900 holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for MY2022 publications are in progress. Among the expected releases are:

  • 2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation
  • AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13
  • Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names
  • MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts
  • HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task
  • DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

It’s not too late to join LDC for MY2020 (through December 31, 2021) and MY2021 (through December 31, 2022). Data sets from those years include 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, Penn Parsed Corpora of Historical English, RATS Speaker Identification, BOLT Egyptian Arabic and Chinese resources (treebanks, propbanks, co-reference), Global TIMIT Mandarin Chinese, and MyST Children’s Conversational Speech.

For full descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.

Spring 2022 Data Scholarship Application Deadline
Applications are now being accepted through January 15, 2022 for the Spring 2022 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.