February 2011 Newsletter
New Corpora
Indian Language Part-of-Speech Tagset: Sanskrit
OntoNotes 4.0
Announcements
Publications Pipeline for MY2011
LDC is pleased to provide the following information on our planned releases for Membership Year 2011 (MY2011) and would like to remind our data users that there is still time to save on membership fees for MY2011, but time is running out! Any organization which joins or renews membership for 2011 through Tuesday, March 1, 2011, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2010 can receive a 10% discount on fees provided they renew prior to March 1, 2011.
Many publications for MY2011 are still in development, but we plan to release updates to some of our popular Gigaword corpora as well as new speech corpora. Please note that the list is tentative and subject to modifications. Our planned publications for this year include:
- 2005 NIST Speaker Recognition Evaluation - the 2005 data from the ongoing series of yearly evaluations conducted by NIST (National Institute of Standards and Technology). These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text-independent speaker recognition.
- Arabic Gigaword Fifth Edition ~ LDC’s Arabic newswire collection from 2009 and 2010 as well as the contents of Arabic Gigaword Fourth Edition (LDC2009T30). The news sources represented include Agence France Presse, An Nahar, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, Assabah Al- Ahram, Ummah Press and Xinhua News Agency.
- Chinese Gigaword Fifth Edition ~ LDC’s Chinese newswire collection from 2009 and 2010 as well as the contents of Chinese Gigaword Fourth Edition (LDC2009T27). The news sources represented include Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Zaobao, People's Liberation Army Daily, People’s Daily, Guangming Daily and China News Service.
- Digital Archive of Southern Speech ~ a geographical sampling of colloquial speech in the Southern United States. Samples of speech were collected through interviews of single subjects speaking on a variety of common topics like family, the weather, household articles and activities, agriculture, and social connections. Speakers range in age from 15 to 90, with an average age of 61.
- English Gigaword Fifth Edition ~ LDC’s English newswire collection from 2009 and 2010 as well as the contents of English Gigaword Fourth Edition (LDC2009T13). The news sources represented include Agence France Presse, Associated Press, Central News Agency (Taiwan), NY Times, Washington Post, Los Angeles Times and Xinhua News Agency.
- MALACH English ~ over 300 hours of English audio recordings of interviews conducted under the auspices of the USC Shoah Foundation Institute for Visual History and Education and associated transcripts produced as part of the Multilingual Access to Large Spoken ArCHives (MALACH) project. The data was collected using table microphones. Recordings are 2-channel, 128 kBps, 44.1 kHz mp2 files, with a different speaker generally predominant in each channel.
2011 Subscription Members are automatically sent all MY2011 data as it is released. 2011 Standard Members are entitled to request 16 corpora for free from MY2011. Non-members may license most data for research use.
LDC Data Scholarship Update
LDC received many solid applications for the second installment of the LDC Data Scholarship Program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. Data use proposals included a range of research interests from entity tagging to parsing to automatic speech recognition which made for a competitive selection process.
We are reviewing applications and will announce our winners soon.