Introduction
The ACE (Automatic Content Extraction) program focuses on developing automatic
content extraction technology to support automatic processing of human language
in text form. The kind of information recognized and extracted from text includes
entities, values, temporal expressions, relations and events. SpatialML is a mark-up
language for representing spatial expressions in natural language documents. SpatialML's
focus is primarily on geography and culturally-relevant landmarks, rather than
biology, cosmology, geology, or other regions of the spatial language domain.
The goal is to allow for potentially better integration of text collections with
resources such as databases that provide spatial information about a domain, including
gazetteers, physical feature databases and mapping services. In ACE 2005 English
SpatialML Annotations, the authors applied SpatialML tags to the English training
data (originally annotated for entities, relations and events) in ACE
2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23) to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06).
The SpatialML annotation scheme is intended to emulate earlier progress on
time expressions such as TIMEX2, TimeML
and the 2005
ACE guidelines.
The main SpatialML tag is the PLACE tag. The central goal of SpatialML is to
map PLACE information in text to data from gazetteers and other databases to
the extent possible. Therefore, semantic attributes such as country abbreviations,
country subdivision and dependent area abbreviations (e.g., US states), and
geo-coordinates are used to help establish such a mapping. LINK and PATH tags
express relations between places, such as inclusion relations and trajectories
of various kinds. Information in the tag along with the tagged location string
should be sufficient to uniquely determine the mapping, when such a mapping
is possible. This also means that redundant information is not included in the
tag.
To the extent possible, SpatialML leverages ISO and other standards towards
the goal of making the scheme compatible with existing and future corpora. The
SpatialML guidelines are compatible with existing guidelines for spatial annotation
and existing corpora within the ACE research program. In particular, the English
Annotation Guidelines for Entities (Version 5.6.6 2006.08.01) were exploited,
specifically the GPE, Location, and Facility entity tags, and the Physical relation
tags, all of which are mapped to SpatialML tags. Ideas were also borrowed from
Toponym Resolution Markup Language of Leidner (2006), the research of Schilder
et al. (2004) and the annotation scheme in Garbin and Mani (2005). Information
recorded in the annotation is compatible with the feature types in the Alexandria
Digital Library. This corpus also leverages the integrated gazetteer database
(IGDB) of Mardis and Burger (2005). Last but not least, this annotation scheme
can be related to the Geography Markup Language (GML) defined by the Open Geospatial
Consortium (OGC), as well as Google Earth's Keyhole Markup Language (KML), to
express geographical features.
SpatialML goes beyond these schemes, however, in terms of providing a richer
markup for natural language that includes semantic features and relationships
that allow mapping to existing resources such as gazetteers. Such a markup can
be useful for (i) disambiguation, (ii) integration with mapping services, and
(iii) spatial reasoning. In relation to (iii), it is possible to use spatial
reasoning not only for integration with applications, but for better information
extraction, e.g., for disambiguating a place name based on the locations of
other place names in the document. SpatialML goes to some length to represent
topological relationships among places, derived from the RCC8 Calculus (Randell
et al. 1992, Cohn et al. 1997).
Addtional information about SpatialML is contained in the paper "SpatialML:
Annotation Scheme for Marking Spatial Expressions in Natural Lanugage,"
which is included in the online documentation for this corpus.
Please direct all questions about this corpus to Janet Hitzeman (hitz@mitre.org)
Samples
For an example of the data in the corpus, please examine this sample.
Content Copyright
Portions © 2003 Agence France-Presse, © 2003 The Associated Press,
© 2003 Cable News Network, LP, LLLP, © 2007 The MITRE Corporation,
© 2003 New York Times, © 2003 Xinhua News Agency, © 2003, 2005, 2006,
2008 Trustees of the University of Pennsylvania |