| Authors: | Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Linnea Micciulla, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Olga Babko-Malaya, Eduard Hovy, Robert Belvin, Ann Houston |
Introduction
Natural language applications like machine translation, question answering, and
summarization currently are forced to depend on impoverished text models like bags of
words or n-grams, while the decisions that they are making ought to be based on the
meanings of those words in context. That lack of semantics causes problems throughout
the applications. Misinterpreting the meaning of an ambiguous word results in failing to
extract data, incorrect alignments for translation, and ambiguous language models.
Incorrect coreference resolution results in missed information (because a connection is
not made) or incorrectly conflated information (due to false connections). Some richer
semantic representation is badly needed.
The OntoNotes project is a collaborative effort between BBN Technologies, the
University of Colorado, the University of Pennsylvania, and the University of Southern
California's Information Sciences Institute to produce such a resource. It aims to annotate
a large corpus comprising various genres of text (news, conversational telephone speech,
weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and
Arabic) with structural information (syntax and predicate argument structure) and
shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds
on two time-tested resources, following the Penn Treebank for syntax and the Penn
PropBank for predicate-argument structure. Its semantic representation will include word
sense disambiguation for nouns and verbs, with each word sense connected to an
ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years.
The authors wish to make this resource available to the natural language research community so
that decoders for these phenomena can be trained to generate the same structure in new
documents. Lessons learned over the years have shown that the quality of annotation is
crucial if it is going to be used for training machine learning algorithms. Taking this cue,
we ensure that each layer of annotation in OntoNotes will have at least 90% inter-
annotator agreement. Our pilot studies have shown that predicate structure, word sense,
ontology linking, and coreference can all be annotated rapidly and with better than 90%
consistency.
Samples
The following screen captures provide examples of the data contained in this corpus.
Sponsorship
This work was suppported in part by the Defense Research Advanced Projects Agency, GALE Program Grant No. HR0011-06-C-0022. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
Content Copyright
Portions © 1989 Dow Jones & Company, Inc., © 1996-2001 Sinorama
Magazine, © 1994-1998 Xinhua News Agency, © 1995, 2005, 2006, 2007
Trustees of the University of Pennsylvania |