Introduction
REFLEX Entity Translation Training/DevTest, Linguistic Data Consortium (LDC)
catalog number LDC2009T11 and isbn 1-58563-514-6, was developed by the LDC for
the Automatic Contact Extraction
(ACE) program. This release constitutes the complete set of training data
and development test data for the 2007
REFLEX Entity Translation evaluation sponsored by the National Institute
of Standards and Technology (NIST) and consists of approximately 67.5k words
of newswire and weblog text for each of three languages: English, Chinese and
Arabic. The data set is made up of 22.5k words of English data, 22.5k words
of Chinese data, and 22.5k words of Arabic data translated into each of the
other two languages and annotated for entities and TIMEX2 extents and normalization.
Entity Annotation. The annotations identify seven types of entities:
Person, Organization, Location, Facility, Weapon, Vehicle and GeoPolitical Entity. Each type is further divided into subtypes (for instance, Person subtypes
include Individual, Group and Indefinite). Annotators tagged all mentions of
each entity within a document, whether named, nominal or pronominal. For every
mention, the annotator identified the maximal extent of the string that represents
the entity and labeled the head of each mention. Nested mentions were also captured.
Each entity was classified according to its type and subtype. Each entity mention
was further tagged according to its class such as specific, generic, attributive,
negatively quantified or under specified. Annotators also reviewed the entire
document to group mentions of the same entity together; they also labeled cases
of metonymy, where the name of one entity is used to refer to another entity
(or entities) related to it.
TIMEX2 Annotation. TIMEX2 annotation
of events and temporal relations fulfills two objectives. The first is the interpretation
of expressions that refer to time. Such expressions tell when something happened,
or how long something lasted, or how often something occurs. Such expressions
also often require knowledge of the temporal context in order to truly understand
them. A second objective is the normalization
of temporal expressions. This facilitates interoperability between systems.
Problems occur, for example, when a programmer in France encodes "October
sixteenth 1962" as "1962.10.16" and one in the U.S. encodes
it as "10/16/1962". It will appear as if two different dates are
being referenced. The standards presented here require
that the same meaning is always encoded in the same way.
Content Copyright
Portions © 2000, 2003 Agence France-Presse, ©
2003 The Associated Press, © 2000 Al Hayat, ©
2000, 2002 An Nahar, © 1994-1998, 2000, 2003 Xinhua
News Agency, © 1994-2009 Trustees of University
of Pennsylvania |