October 2017 Newsletter

Wednesday, October 18, 2017

New Publications

RATS Keyword Spotting

English Web Treebank Propbank

Ancient Chinese Corpus

MWE-Aware English Dependency Corpus Version 2.0

Announcements

LDC Awards Fall Data Scholarships
LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research.  Congratulations to all of our recipients!

Membership Year 2018 Publication Preview
The 2018 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forums, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals) TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting (longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns)

Check your inbox in the coming weeks for more information about membership renewal.