February 2014 Newsletter

Monday, February 17, 2014

New Corpora

GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2

King Saud University Arabic Speech Database

NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source

Announcements

Spring 2014 LDC Data Scholarship Recipients
LDC is pleased to announce the student recipients of the Spring 2014 LDC Data Scholarship program!  This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen two proposals to support.   The following students will receive no-cost copies of LDC data:

  • Skye Anderson ~ Tulane University (USA), BA candidate, Linguistics.  Skye has been awarded a copy of LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 for her work in author profiling.
  • Hao Liu ~ University College London (UK), PhD candidate, Speech, Hearing and Phonetic Sciences.  Hao has been awarded a copy of Switchboard-1 Release 2, and NXT Switchboard Annotations for his work in prosody modeling.

Membership Fee Savings and Publications Pipeline
Members can still save on 2014 membership fees, but time is running out. Any organization which joins or renews membership for 2014 through Monday, March 3, 2014, is entitled to a 5% discount. Organizations which held membership for MY2013 can receive a 10% discount on fees provided they renew prior to March 3, 2014.

Planned publications for this year include:

  • 2009 NIST Language Recognition Evaluation ~  development data from VOA broadcast and CTS telephone speech in target and non-target languages.
  • ETS Corpus of Non-Native Written English ~ contains 1100 essays written for a college-entrance test sampled from eight prompts (i.e., topics)  with score levels (low/medium/high) for each essay.
  • GALE data ~ including Word Alignment, Broadcast Speech & Transcripts, Parallel Text, Parallel Aligned Treebanks in Arabic, Chinese, and English.
  • Hispanic Accented English ~ contains approximately 30 hours of spontaneous speech and read utterances from non-native speakers of English with corresponding transcripts.
  • Multi-Channel Wall Street Journal Audio-Visual Corpus (MC-WSJ-AV) ~  re-recording of parts of the WSJCAM0 using a number of microphones as well as three recording conditions resulting in 18-20 channels of audio per recording.
  • TAC KBP Reference Knowledge Base ~  TAC KBP aims to develop and evaluate technologies for building and populating knowledge bases (KBs) about named entities from unstructured text.  KBP systems must either populate an existing reference KB, or else build a KB from scratch. The reference KB for is based on a snapshot of English Wikipedia snapshot from October 2008 and contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article.
  • USC-SFI MALACH Interviews and Transcripts Czech ~ developed by The University of Southern California's Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 143 hours of interviews from 420 interviewees along with transcripts and other documentation.

New LDC Website Enhancements Coming Soon
Look for LDC’s new website enhancements in the coming weeks. We've revamped our membership services to make it easier than ever for you to manage your membership and access data more quickly.