Introduction
Datasets for Generic Relation Extraction (reACE) was developed at The
University of Edinburgh, Edinburgh, Scotland. It consists of English broadcast
news and newswire data originally annotated for the ACE
(Automatic Content Extraction) program to which the Edinburgh Regularized
ACE (reACE) mark-up has been applied.
The Edinburgh relation extraction (RE) task aims to identify useful information
in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and
to recode it in a format such as a relational database or RDF
triple store (a database for the storage and retreival of Resource Description
Framework (RDF) metadata) that can be more effectively used for querying and
automated reasoning. A number of resources have been developed for training
and evaluation of automatic systems for RE in different domains. However, comparative
evaluation is impeded by the fact that these corpora use different markup formats
and different notions of what constitutes a relation.
reACE solves this problem by converting
data to a common document type using token standoff and including detailed linguistic
markup while maintaining all information in the original annotation. The subsequent
reannotation process normalises the two data sets so that they comply with a
notion of relation that is intuitive, simple and informed by the semantic web.
The data in this corpus consists of newswire and broadcast news material from
ACE
2004 Multilingual Training Corpus LDC 2005T09 and ACE
2005 Multilingual Training Corpus LDC2006T06. This material has been standardised
for evaluation of multi-type RE across domains.
Complete documentation for this corpus is available at the publication providers web
site Datasets for Generic Relation Extraction.
Data
Annotation includes (1) a refactored version of the original data to a common
XML document type (2) linguistic information from
LT-TTT (a system for tokenizing text and adding markup) and MINIPAR
(an English parser) and (3) a normalised version of the original RE markup
that complies with a shared notion of what constitutes a relation across domains.
The data sources represented in the corpus were collected by LDC in 2000 and
2003 and consist of the following: ABC, Agence France Presse, Associated Press,
Cable News Network, MSNBC/NBC, New York Times, Public Radio International, Voice
of America and Xinhua News Agency.
Samples
For an example of the data contained in this corpus, please examine this sample file.
Content Copyright
Portions © 2000 American Broadcasting Corporation, © 2000, 2003 Cable
News Network, LP, LLP, © 2000 National Broadcasting Company, © 2000
New York Times, © 2000 Public Radio International, © 2000 The Associated
Press, © 2005, 2006, 2011 Trustees of the University of Pennsylvania |