September 2010 Newsletter

Friday, September 17, 2010

New Corpora

Indian Language Part-of-Speech Tagset: Bengali

Message Understanding Conference 7 Timed (MUC7_T)


Free Copies of OntoNotes Available

LDC is pleased to announce that the OntoNotes data sets are now available at no-cost.  The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

OntoNotes builds on and extends two time-tested resources, the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for verbs and some nouns, with many of the word senses connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years.

LDC currently offers three versions of OntoNotes:

LDC2007T21 OntoNotes Release 1.0:  contains 400k words of Chinese newswire data and 300k words of English newswire data

LDC2008T04 OntoNotes Release 2.0:  adds the following to Release 1.0:   274k words of Chinese broadcast news data and 200k words of English broadcast news data

LDC2009T24 OntoNotes Release 3.0:  adds English and Chinese broadcast conversation data to Release 2.0.   This release includes 250k words of English newswire data, 200k of English broadcast news data, 200k words of English broadcast conversation material, 250k words of Chinese newswire data, 250k words of Chinese broadcast news material, 150k words of Chinese broadcast conversation data and 200k words of Arabic newswire material.

All OntoNotes releases are distributed on one DVD and are subject to shipping and handling fees.  In addition to OntoNotes, LDC distributes a wide range of free databases.  These include version 1.0 of the Buckwalter Arabic Morphological Analyzer, TimeBank, FactBank, and data sponsored by the TalkBank project.  For further information, please visit our What's New! What's Free! Archive.

LDC Data Scholarship Program Update

LDC is excited to announce that we've received many strong applications for our Fall 2010 LDC Data Scholarship program!  The LDC Data Scholarship program provides university students with access to LDC data at no-cost.  Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser.  LDC will provide information on our scholarship winners in our October newsletter.  The next program cycle is scheduled for the Spring 2011 semester.

LDC at Interspeech 2010, Makuhari Japan, September 27-30, 2010

LDC will soon be traveling to the Far East to exhibit at Interspeech 2010 in Makuhari Japan. We are very enthusiastic about this opportunity to mingle with members of the speech research community in a far-away setting. Please stop by booth #27 to say hi and to try your luck at scoring an exciting giveaway! We hope to see you there!

Interspeech 2010’s central theme is ‘Spoken Language Processing for All’.