October 2016 Newsletter

Wednesday, October 19, 2016

New Corpora

IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5

 KAFD: Arabic Font Database

 Richer Event Description


Fall 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Fall 2016 data scholarships:

Tiba Zaki Abdulhameed: Western Michigan University (USA); PhD Candidate, Computer Science. Tiba is awarded copies of GALE Phase 2 Arabic Broadcast Conversation Speech and Transcripts for her research in dialectal ASR.

Abhishek Abhishek: Indian Institute of Technology Guwahati (India); PhD Candidate, Computer Science and Engineering. Abhishek is awarded a copies of ACE 2004 Multilingual Training Corpus and The New York Times Annotated Corpus for his research in coreference resolution and relation extraction.

Sara Ebrahim: Ain Shams University (Egypt); Msc, Computer Science. Sara is awarded copies of LDC Standard Arabic Morphological Analyzer and NIST OpenMT 2008 Evaluation Selected References and System Translations for her work in machine translation.

Katherine Metcalf: Indiana University (USA), PhD Candidate, Computer Science. Katherine is awarded a copy of Emotional Prosody Speech and Transcripts for her research in acoustic/prosodic approaches to classifying emotional states.

Mousmita Sarma: Gauhati University (India), Post-Masters Research, Electronics and Communications Technology. Mousmita is awarded copies of Switchboard 1-Release 2 and IARPA Babel Assamese Language Pack for her research in Assamese dialect identification.

For program information visit the Data Scholarship page.


Chilin HK and LDC partner on distribution of parallel patent data

Chilin HK Limited (Chilin) and LDC are pleased to announce that the parallel data resource developed by Chilin, Chinese-English Parallel Sentences Extracted from Patents, is now available through the LDC Catalog. This is a special release in addition to the LDC scheduled corpora for membership year 2016, available under separate terms.

The Chilin Corpus has primarily resulted from training corpus and test sets developed specifically for the Tokyo-based NTCIR 2009 & 2010 competitions on Patent MT (machine translation), which drew more than 30 international teams:

NTCIR-9: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-PATENTMT-GotoI.pdf

NTCIR-10: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings10/pdf/NTCIR/OVERVIEW/01-NTCIR10-PATENTMT-GotoI.pdf

The training corpus is drawn from a much larger curated corpus of parallel Chinese-English sentences and sentence fragments which have been winnowed from an even larger corpus of more than 300k parallel Chinese-English patents in different fields, initially at the Research Centre on Language Information Sciences, City University of Hong Kong (authors:  Benjamin Tsou, Bin Lu, and Kapo Chow). This data set is available from LDC under the following reference:

LDC2016T22   Chinese-English Parallel Sentences Extracted from Patents

Not-for-profit organizations may license this data set for US$25.00 under the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement for Non-Members for use in linguistic research, education and non-commercial technology development. For-profit organizations may license this data for US$5000, discounted to US$4000 for LDC for-profit members, under a commercial license.