Introduction
OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number LDC2011T03
and isbn 1-58563-574-X, was developed as part of the OntoNotes project, a collaborative
effort between BBN Technologies, the University of Colorado, the University
of Pennsylvania and the University of Southern California's Information Sciences Institute.
The goal of the project is to annotate a large corpus comprising various genres
of text (news, conversational telephone speech, weblogs, usenet newsgroups,
broadcast, talk shows) in three languages (English, Chinese, and Arabic) with
structural information (syntax and predicate argument structure) and shallow
semantics (word sense linked to an ontology and coreference). OntoNotes Release
4.0 is supported by the Defense Advance Research Project Agency, GALE Program
Contract No. HR0011-06-C-0022.
OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes
Release 1.0 LDC2007T21,
OntoNotes Release 2.0 LDC2008T04 and OntoNotes
Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation
and web data in English and Chinese and newswire data in Arabic. This cumulative
publication consists of 2.4 million words as follows: 300k words of Arabic
newswire; 250k words of Chinese newswire, 250k words of Chinese broadcast news,
150k words of Chinese broadcast conversation and 150k words of Chinese web text;
and 600k words of English newswire, 200k word of English broadcast news, 200k
words of English broadcast conversation and 300k words of English web text.
The OntoNotes project builds on two time-tested resources, following the
Penn Treebank for syntax and the Penn
PropBank for predicate-argument structure. Its semantic representation will
include word sense disambiguation for nouns and verbs, with each word sense
connected to an ontology, and coreference. The current goals call for annotation
of over a million words each of English and Chinese, and half a million words
of Arabic over five years.
Data
Documents describing the annotation guidelines
and the routines for deriving various views of the data from the database are
included in the documentation directory of this release. The annotation is provided
both in separate text files for each annotation layer (Treebank, PropBank, word
sense, etc.) and in the form of an integrated relational database (ontonotes-v4.0.sql.gz)
with a Python API to provide convenient cross-layer access.
Tools
This release includes OntoNotes DB Tool v0.999 beta, the tool used to assemble
the database from the original annotation files. It can be found in the directory
ontonotes-db-tool-v0.999b. This tool can be used to derive various views of the
data from the database, and it provides an API that can implement new queries
or views. Licensing information for the OntoNotes DB Tool package is included
in its source directory.
Updates
Additional information, updates, bug fixes may be available in the LDC catalog
entry for this corpus at LDC2011T03.
Sponsorship
This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content
of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.
Samples
Content Copyright
Portions © 2006 Abu Dhabi TV, © 2006 Agence France Presse, ©
2006 Al-Ahram, © 2006 Al Alam News Channel, © 2006 Al Arabiya, ©
2006 Al Hayat, © 2006 Al Iraqiyah, © 2006 Al Quds-Al Arabi, ©
2006 Anhui TV, © 2002, 2006 An Nahar, © 2006 Asharq-al-Awsat, ©
2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System,
© 2000-2001, 2005-2006 China Central TV, © 2006 China Military Online,
© 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001
China Television System, © 1989 Dow Jones & Company, Inc., © 2006
Dubai TV, © 2006 Guangming Daily, © 2006 Kuwait TV, © 2005-2006
National Broadcasting Company, Inc., © 2006 New Tang Dynasty TV, ©
2006 Nile TV, © 2006 Oman TV, © 2006 PAC Ltd, © 2006 People's
Daily Online, © 2005-2006 Phoenix TV, © 2000-2001 Sinorama Magazine,
© 2006 Syria TV, © 1996-1998, 2006 Xinhua News Agency, © 2007,
2008, 2009, 2011 Trustees of the University of Pennsylvania
Contact:
ldc@ldc.upenn.edu © 2011 Linguistic
Data Consortium , Trustees of the
University of Pennsylvania . All Rights Reserved. |