Introduction
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple
Languages, Linguistic Data Consortium (LDC) catalog number LDC2011T01 and isbn
1-58563-572-3, is a subset of OntoNotes
Release 2.0 LDC2008T04 used in SemEval-2010
Task 1, Coreference Resolution in Multiple Languages. OntoNotes Release
2.0 consists of roughly 500,000 words of English broadcast and newswire data
annotated with structural information (syntax and predicate argument structure)
and shallow semantics (word sense linked to an ontology and coreference). This
SemEval-2010 Task 1 release contains approximately 120,000 words extracted from
the OntoNotes corpus and formatted for the SemEval task.
SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational
semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and
compare automatic coreference resolution systems for six languages (Catalan,
Dutch, English, German, Italian and Spanish) in four evaluation settings using
four metrics. Further information about Task 1 can be found on the task
description website. The task organizers included researchers from Universitat
de Barcelona (Spain), Universitat Politècnica de Catalunya (Spain), University
of Essex (United Kingdom), Universita di Trento (Italy), Hogeschool Gent (Netherlands),
University of Tübingen (Germany) and Stanford University (USA).
Data
The data is divided into three sets: the development
set (*/data/en.devel.txt) which contains 39 documents, 741 sentences and 17,044
tokens; the training set (*/data/en.train.txt) which contains 229 documents, 3,648
sentences and 79,060 tokens; and the test set (*/data/en.test.txt) which contains
85 documents, 1,141 sentences and 24,206 tokens. The complete material for training
systems is the sum of the development and training sets. Details of the SemEval
task formatting applied to the data can be found in the documentation file, en.info.txt.
Scorer
The official scorer is available from the the task
download page.
Updates
An update was issues on March 30th, 2012 for this corpus. A bug was fixed that caused one annotation error in every document. All data downloaded after this date will be the correct release. Contact ldc@ldc.upenn.edu if you have any
questions.
Samples
For an example of the data in this publication, please review this text file excerpt.
Content Copyright
Portions © 2000-2001 American Broadcasting Company, © 2000-2001
Cable News Network, LP, LLP, © 1989 Dow Jones & Company, Inc., ©
2000-2001 National Broadcasting Company, Inc., © 2000-2001 Public Radio
International, © 1995, 2005, 2006, 2007, 2008, 2011 Trustees of the University
of Pennsylvania
The World is a co-production of Public Radio International and the British
Broadcasting Corporation and is produced at WGBH Boston. |