Introduction
The OntoNotes project is a collaborative effort between BBN Technologies, the
University of Colorado, the University of Pennsylvania, and the University of
Southern California's Information Sciences Institute. The goal of the project
is to annotate a large corpus comprising various genres of text (news, conversational
telephone speech, weblogs, use net, broadcast, talk shows) in three languages
(English, Chinese, and Arabic) with structural information (syntax and predicate
argument structure) and shallow semantics (word sense linked to an ontology
and coreference). OntoNotes Release 2.0 is a continuation of the OntoNotes project
and is supported by the Defense Advanced Research Projects Agency, GALE Program
Contract No. HR0011-06-C-0022.
OntoNotes
Release 1.0 (LDC2007T21) contains 400k words of Chinese newswire data (from
Xinhua News Agency and Sinorama Magazine) and 300k words of English newswire
data (from the Wall Street Journal). OntoNotes Release 2.0 adds the following
to the corpus: 274k words of Chinese broadcast news data (from China Broadcating
System, China Central TV, China National Radio, China Television System and
Voice of America); and 200k words of English broadcast news data (from ABC,
CNN, NBC, Public Radio International and Voice of America).
Natural language applications like machine translation, question answering,
and summarization currently are forced to depend on impoverished text models
like bags of words or n-grams, while the decisions that they are making ought
to be based on the meanings of those words in context. That lack of semantics
causes problems throughout the applications. Misinterpreting the meaning of
an ambiguous word results in failing to extract data, incorrect alignments for
translation, and ambiguous language models. Incorrect coreference resolution
results in missed information (because a connection is not made) or incorrectly
conflated information (due to false connections). OntoNotes builds on two time-tested
resources, following the Penn Treebank for syntax and the Penn PropBank for
predicate-argument structure. Its semantic representation will include word
sense disambiguation for nouns and verbs, with each word sense connected to
an ontology, and coreference. The current goals call for annotation of over
a million words each of English and Chinese, and half a million words of Arabic
over five years.
The authors wish to make this resource available to the natural language research
community so that decoders for these phenomena can be trained to generate the
same structure in new documents. Lessons learned over the years have shown that
the quality of annotation is crucial if it is going to be used for training
machine learning algorithms. Taking this cue, each layer of annotation in OntoNotes
will have at least 90% inter-annotator agreement. Pilot studies have shown that
predicate structure, word sense, ontology linking, and coreference can all be
annotated rapidly and with better than 90% consistency.
Samples
For an example of the data in this corpus, please examine the following samples
Sponsorship
This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
Content Copyright
Portions © 2000-2001 American Broadcasting Company, © 2000-2001 Cable
News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001
China Central TV, © 2000-2001 China National Radio, © 2000-2001 China
Television System, © 1989 Dow Jones & Company, Inc., © 2000-2001
National Broadcasting, Company, Inc., © 2000-2001 Public Radio International,
© 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, ©
1995, 2005, 2006, 2007, 2008 Trustees of the University of Pennsylvania
The World is a co-production of Public Radio International and the British
Broadcasting Corporation and is produced at WGBH Boston. |