February 2013 Newsletter

New Corpora

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 [1]

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 [2]

NIST 2012 Open Machine Translation (OpenMT) Evaluation [3]

Announcements

Spring 2013 LDC Data Scholarship Recipients
LDC is pleased to announce the student recipients of the Spring 2013 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen three proposals to support. The following students will receive no-cost copies of LDC data:

Salima Harrat - Ecole Supérieure d’informatique (ESI) (Algeria). Salima has been awarded a copy of Arabic Treebank: Part 3 for her work in diacritization restoration.

Maulik C. Madhavi - Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar (India). Maulik has been awarded a copy of Switchboard Cellular Part 1 Transcribed Audio and Transcripts and 1997 HUB4 English Evaluation Speech and Transcripts for his work in spoken term detection.

Shereen M. Oraby - Arab Academy for Science, Technology, and Maritime Transport (Egypt). Shereen has been awarded a copy of Arabic Treebank: Part 1 for her work in subjectivity and sentiment analysis.

Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2013 semester.

Membership Fee Savings and Publications Pipeline
Time is quickly running out to save on membership fees for Membership Year 2013 (MY2013)! Any organization which joins or renews membership for 2013 through Friday, March 1, 2013, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2012 can receive a 10% discount on fees provided they renew prior to March 1, 2013.

Many publications for MY2013 are still in development. The planned publications for the upcoming months include:

GALE data ~ continuing releases of all languages (Arabic, Chinese, English), genres (Broadcast News, Broadcast Conversation, Newswire and Web Data) and tasks (Parallel Text, Word Alignment, Parallel Aligned Treebanks, Parallel Sentences, Audio and Transcripts).

Hispanic Accented English Database ~ 30 hours of conversational speech data from non-native speakers of English with approximately 24 hours or 80% of the data closely transcribed. The speech in this release was collected from 22 non-native, Hispanic speakers of English and consists of spontaneous speech and read utterances. The read speech is divided equally between English and Spanish.

NIST 2012 Open Machine Translation Progress Tests ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT12 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. This set is based on a subset of the Arabic-to-English and Chinese-to-English Progress tests from the NIST Open Machine Translation 2008, 2009, and 2012 evaluations with new source data created based on the English human reference translation reference. The original data consists of newswire and web data.

NIST Open Machine Translation 2008 to 2012 Progress Test Sets ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English Progress tests of the NIST Open Machine Translation 2008, 2009, and 2012 Evaluations. The test sets consist of newswire and web data.

OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.

UN Parallel Text ~ contains the text of United Nations parliamentary documents in Arabic, Chinese, English, French, Russian, and Spanish from 1993 through 2007. The data is provided in two formats: (1) raw text: the raw text is very close to what was extracted from the word processing documents, converted to UTF-8 encoding, and (2) word-aligned text: the word-aligned text has been normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential "chunk-pairs", and then aligned at the word-level.

New LDC Podcast, LDC Executive Director, Christopher Cieri
The LDC blog has a new podcast in LDC’s 20th Anniversary series. This edition features LDC’s Executive Director, Christopher Cieri. In this podcast, Chris reflects on the road that took him to LDC, some of his early responsibilities and recent consortium activities. Other podcasts will be published via the LDC blog , so stay tuned to that space.