Introduction
ACE 2005 English SpatialML Annotations Version 2, Linguistic Data Consortium
(LDC) catalog number LDC2011T02 and isbn 1-58563-573-1, was developed by researchers
at The MITRE Corporation and applies SpatialML
tags to the English newswire and broadcast training data annotated for entities,
relations and events in
ACE 2005 Multilingual Training Corpus LDC2006T06. This second version eliminates
a number of annotation inconsistencies and errors identified in ACE
2005 English SpatialML Annotations LDC2008T03. In addition, the SpatialML
annotation schema has been updated from version 2.0 to version 3.0.1 the revised
annotation guidelines are included in this release.
The ACE (Automatic Content Extraction) program focused on developing automatic
content extraction technology to support automatic processing of human language
in text form., specifically, entities, values, temporal expressions, relations
and events. SpatialML is a mark-up language for representing spatial expressions
in natural language documents. It is intended to emulate earlier progress on
time expression such as TIMEX2, TimeML, and
the 2005 ACE guidelines.
SpatialML includes syntax for marking up PLACEs mentioned in text and for linking
them to data from gazetteers and other databases. LINKs are used to express
relations between places, and RLINKs to capture trajectories for relative locations.
To the extent possible, SpatialML leverages ISO and other standards with the
goal of making the scheme compatible with existing and future corpora. SpatialML
goes beyond these schemes, however, in terms of providing a richer markup for
natural language that includes semantic features and relationships that allow
mapping to existing resources such as gazetteers. Such markup can be useful
for disambiguation, integration with mapping services and spatial reasoning.
Data
This corpus contains 210065 total words and 17821 unique words. Counts of
unique words can be found in doc/ldc_wordcount.csv which includes all words
that are not part of XML markup (e.g., without tag names, attribute names or
values). Unique words are counted by comparing case insensitive transformations
with preceding and trailing punctuation stripped off. Words consisting
solely of punctuation are discarded.
The principal change in the annotation schema is that PATH has
been generalized to RLINK for relative link. At the top level, there
is now a version attribute on the root SpatialML tag to capture which version
of SpatialML was used. A number of smaller changes have been made to the annotation
specification these are listed in Section 2 of the updated guidelines.
The files are provided in both in-line xml format and aif format.
The gaz-deref files contain multiple gazetteer references when they exist
for a single location these different gazrefs sometimes correspond to slightly
different latlongs. The sgm.dtd validated files do not contain document structure
tags (such as , ) that would prevent them from being validated
with the SpatialML DTD. These files total 22624650 bytes uncompressed.
Updates
Additional information, updates, bug fixes may be available in the
LDC catalog entry for this corpus at
LDC2011T02.
Samples
Content Copyright
Portions © 2003 Agence France Presse, © 2003 The Associated Press,
© 2003 Cable News Network, LP, LLLP, © 2007, 2010 The MITRE Corporation,
© 2003 New York Times, © 2003 Xinhua News Agency, © 2003, 2005,
2006, 2008, 2011 Trustees of the University of Pennsylvania |