October 2010 Newsletter

Friday, October 15, 2010

New Corpora

ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0

Korean Newswire Second Edition

NIST 2006 Open Machine Translation (OpenMT) Evaluation


Fall 2010 LDC Data Scholarship Winners!

LDC is pleased to announce the winners in our first-ever LDC Data Scholarship program!  The LDC Data Scholarship program provides university students with access to LDC data at no-cost.  Data scholarships are offered twice a year to correspond to the Fall and Spring semesters.  Students are asked to complete an application which consists of a data use proposal and letter of support from their academic adviser. 

LDC received many strong applications from both undergraduate and graduate students attending universities across the globe.  The decision process was difficult, and after much deliberation, we have selected eight winners!   These students will receive no-cost copies of LDC data valued at over US$10,000:

    Aby Abraham - Ohio University (USA), graduate student, Electrical Engineering.  Aby has been awarded a copy of 2003 NIST Speaker Recognition Evaluation (LDC2010S03) for his work in using long term memory cells for continuous speech recognition.

    Ripandy Adha - Bandung Institute of Technology (Indonesia), undergraduate student, Computer Science - Ripandy has been awarded a copy of American English Spoken Lexicon (LDC99L23) to assist in the development of a voice command internet browser.

    Basawaraj - Ohio University (USA), PhD candidate, Electrical Engineering and Computer Science.  Basawaraj has been awarded a copy of NIST 2002 Open Machine Translation (OpenMT) Evaluation (LDC2010T10) to assist in fine tuning his machine translation system and to provide a benchmark dataset.

    Zachary Brooks - University of Arizona (USA), PhD Candidate, Second Language Acquisition and Teaching.  Zachary and his research group have been awarded a copy of ECI Multilingual Text (LDC94T5) for research in eye movement tracking by native and non-natives readers.

    Marco Carmosino - Hampshire College (USA), undergraduate student, Computer Science.  Marco has been awarded a copy of English Gigaword Fourth Edition (LDC2009T13) for his work in narrative chain extraction.

    Xiaohui Huang - Harbin Institute of Technology (China), Shenzhen Graduate School.  Xiaohui has been awarded a copy of TDT5 Topics and Annotations (LDC2006T19)  for his work in topic detection and tracking for large-scale web  data.

    Yuhuan Zhou - PLA University of Science and Technology (China), postgraduate student, Institute of Communications Engineering.  Yuhuan has been awarded a copy of 2002 NIST Speaker Recognition Evaluation (LDC2004S04) to assist in the development of a speaker recognition system which fuses support vector data description (SVDD) and Gaussian mixture model (GMM).

    Speaker Recognition Group (GEDA) with members Matias Fineschi, Gonzalo Lavigna, Jorge Prendes, and Pablo Vacatello -  Buenos Aires Institute of Technology (Argentina), Department of Electrical Engineering.  GEDA has been awarded a copy of 2004 NIST Speaker Recognition Evaluation (LDC2006S44) to assist in the development of a flexible platform on speaker verification capable of implementing different feature extraction, normalizations, stochastical models and outputs.

Please join us in congratulating our student winners!   The next LDC Data Scholarship program is scheduled for the Spring 2011 semester. Stay tuned for further announcements.

LDC at Interspeech 2010

We would like to thank all of the Interspeech 2010 attendees who stopped by the LDC display in Makuhari Japan. We had the chance to interact with a great mix of speech researchers from around the globe, and we hope that we were able to answer your questions about the Consortium. The exhibition hall also provided LDC with an opportunity to showcase the new additions to our Data Sheet collection, which continue to be printed on FSC-certified 30% recycled paper.

The most frequently-asked questions from conference attendees concerned which Asian languages and corresponding data types are represented in the LDC Catalog. Over the past 18 years, we have produced nearly 200 corpora in over 20 Asian languages, primarily in Chinese, Arabic, Japanese and Korean. These data sets are comprised of telephone speech, broadcast news and broadcast conversation, video keyframes, newswire, web collections and transcribed speech. To date, half of LDC’s 2010 publications are partly or primarily Asian language datasets, and we expect to release additional Chinese, Korean and Arabic corpora in the coming year.

LDC anticipated these queries and in the months leading up to Interspeech, we prepared an Asian Spoken Language Sampler, LDC2010S07, which showcases some of these releases. The sampler is freely available for download here

Commercial Use of LDC Data

LDC members and licensees are reminded that LDC data cannot be used for commercial purposes, except that commercial organizations may conduct research and commercial technology development with LDC data received when the organization was an LDC for-profit member unless use of that data is otherwise restricted by a corpus-specific user license.  Not-for-profit members and non-members, including non-member commercial organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose.

To help further clarify commercial use of LDC data, consider the following two cases in which a commercial organization licenses LDC data.  In the first case, a company has joined LDC as a For-Profit Member for the current year.   As a member, this company will gain commercial rights to data from the year that they have joined unless otherwise restricted by a corpus-specific user license.  Furthermore, while a member for  the current year, the company can license data for commercial use from closed Membership Years at the Reduced Licensing Fee.   If the company does not renew its membership for the following year, the company still retains ongoing commercial rights to data it licensed as a For-Profit member and any data from their Membership Year.  This company will not have a commercial license to any new data obtained after their Membership Year has ended.

In the second case, a company licenses data as a non-member.  At this point, the company is not an LDC member and cannot use LDC data for any commercial purpose.  If that company  joins LDC in the future, that company will gain commercial rights to any data already licensed, unless those rights are otherwise restricted.  A commercial organization can first license data as a non-member for research purposes and then join LDC to gain commercial rights to that data.
LDC data users are also reminded to consult corpus-specific license agreements for limitations, including commercial restrictions, on the use of certain corpora. In the case of a small group of corpora that includes American National Corpus (ANC) Second Release (LDC2005T35), Buckwalter Arabic Morphological Analyzer Version 2.0 (LDC2004L02) and all CSLU corpora, commercial licenses must be obtained separately from the owners of the data. A full list of corpus-specific user licenses can be found on our License Agreements page. 

Position Openings at LDC

Linguistic Data Consortium at the University of Pennsylvania has a number of immediate openings for full-time positions to support our corpus development projects:

        PROGRAMMER ANALYST - (#100528459 and #100929195)

    Support linguistic data collection and annotation projects by providing software development, system integration, technical and research support, annotation tool development and/or data collection system management.

        SENIOR PROJECT MANAGER (#100728923 and #100728924)

    Provide complete oversight for multiple, concurrent corpus creation projects, including collection, annotation and distribution of speech, text and/or video data in a variety of languages. Create project roadmaps and direct teams of programmers, linguists and managers to execute deliverables; represent corpus creation efforts to external researchers and sponsors.

        LEAD ANNOTATOR (#100728920)

    Perform linguistic annotation on English text, speech and video data; recruit, train and supervise teams of annotators for multiple tasks and languages; define, test and document procedural approaches to linguistic annotation;perform quality control on annotated data.

For further information on the duties and qualifications for these positions, or to apply online please visit https://jobs.hr.upenn.edu/; search postings for the reference numbers indicated above.

Penn offers an excellent benefits package including medical/dental, retirement plans, tuition assistance and a minimum of three weeks paid vacation per year. The University of Pennsylvania is an affirmative action/equal opportunity employer.  All positions contingent upon grant funding.