November 2018 Newsletter

Thursday, November 15, 2018

New Publications


Avatar Education Portuguese

BOLT Egyptian Arabic Treebank - Discussion Forum

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a


Join LDC for Membership Year 2019

Membership Year 2019 (MY2019) is open and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2019, current MY2018 members who renew their LDC membership before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 750 holdings. Current-year for-profit members may use most data for commercial applications. 

Plans for MY2019 publications are in progress. Among the expected releases are:

  • SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
  • Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
  • TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
  • CALLFRIEND Second Edition: updated releases with .wav format audio, simplified directory structure and enhanced documentation and metadata (English, Egyptian Arabic, Mandarin Chinese-Taiwan)
  • HAVIC Med Progress Test data: English web video, metadata, and annotations for developing multimedia systems
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
  • BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

And, it’s not too late to join for MY2017 (through December 31, 2018) and MY2018 (through December 31, 2019). Data sets from those years include 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting and Language Identification releases, CHiME, Noisy TIMIT Speech, Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, LORELEI Amharic and Somali Language Packs and DEFT Spanish Treebank. For full descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.

Spring 2019 Data Scholarship Program

Applications are now being accepted through January 15, 2019 for the Spring 2019 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.