Introduction
Message Understanding Conference 7 Timed (MUC7_T), Linguistic Data
Consortium (LDC) catalog number LDC2010T15 and isbn 1-58563-560-X,
was developed by researchers at Jena University Language &
Information Engnineering (JULIE) Lab,
Friedrich-Schiller-Universität Jena, Germany. It is a
re-annotation of a portion of the
MUC7
corpus (Linguistic Data Consortium, LDC2001T02), which consists of
New York Times news stories annotated for use in the Message
Understanding Conference 7 (MUC7) evaluation. The series of MUC
evaluations in the 1990s focused on emerging information extraction
technologies. Further information about NIST's MUC7 evaluation can be
found
MUC project website.
MUC7_T consists of 100 articles from the MUC7 corpus training set
reannotated for named entities (persons, locations and organizations)
with a time stamp indicating the time measured for the linguistic
decision making process. The corpus was developed for two principal
purposes: for use in evaluations of selective sampling strategies,
such as Active Learning; and to create predictive models for
annotation costs. The annotation was performed by two advanced
students of linguistics with good English language skills who
followed the the original guidelines of the MUC7 named entity task
(which can be found in the
online
documentation for the MUC7 corpus).
Data
The data is stored in XML format. There is an element anno_example
for each annotation example that has the original MUC7 document as
text context. The MUC7 document was tokenized using the Stanford
Tokenizer3 with white spaces marking token boundaries. The tokenizer
is part of the Stanford Parser package which can be obtained from
The
Stanford Natural Language Processing Group. The following attributes
are used for the element anno_example:
Attribute
|
Explanation
|
|
anno_time
|
The time it took to annotate the annotation unit of this
annotation example (time in milliseconds).
|
|
anno_unit_tokens
|
All tokens of the annotation unit.
|
|
anno_unit_offset
|
Offsets for the tokens of the annotation unit relative to
all tokens in the annotation example.
|
|
anno_unit_labels
|
Labels for the tokens of the annotation unit (these labels
are taken from MUC7).
|
|
doc_id
|
ID of the document of the annotation example.
|
|
sent_id
|
ID of the sentence of the annotation example.
|
|
anno_unit_id
|
ID of the unit of the annotation example.
|
|
muc7_org_filename
|
The name of the original MUC7 document from which this
annotation example is taken.
|
Dirctory Structure
The directory structure of the corpus is as follows:
data: This subdirectory contains the MUC7_T
data; the data for annotator A and B are in separate folders. For each
annotator, there is a version of MUC7_T with CNP-level and with
sentence-level annotations.
docs: This subdirectory contains detailed
documentation as well as publications describing applications of
MUC7_T. There is also a small JavaDoc for the Java tools (see the
tools subdirectory below).
dtd: This subdirectory contains the
Document Type Definition (DTD) for the data files.
tools: This subdirectory contains a small Java API
which allows users to read the MUC7_T XML data so that each annotation
example is represented by a Java object. The API incudes the source
code and a jar package. The source code has been tested with Java 1.5
and Java 1.6.
Updates
Additional information, updates, bug fixes may be available in the
LDC catalog entry for this corpus at
LDC2010T15.
Samples
The following XML excerpts are representative the data in this corpus:
Content Copyright
Portions © 1996 New York Times, © 2001, 2010 Trustees of
the University of Pennsylvania
|