Introduction:
AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data
Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed
by LDC for NIST's (National Institute for Standards and Technology) AQUAINT
2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of
English news text from six distinct sources collected by LDC (Agence France
Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington
Post, New York Times and Xinhua News Agency) covering the period from October
2004 through March 2006. The AQUAINT-2 collection is the second part of a series
intended to provide data useful for developing, evaluating and testing information
extraction and retrieval systems. It follows the publication of The
AQUAINT Corpus of English News Text (LDC2002T31).
The AQUAINT (Advanced
Question-Answering for Intelligence) program addresses interactivity with
scenarios or tasks. The scenario provides a context in which questions will be
asked and answered, and the task reflects the overall assignment. The program
is committed to solve a single problem: how to find topically relevant, semantically
related, timely information in massive amounts of data in diverse languages, formats,
and genres.
AQUAINT technology is advancing the development of components and functions
that allows users to pose a series of intertwined, complex questions and obtain
comprehensive answers in the context of broad information-gathering tasks. In
addition, while most information retrieval systems present only links to documents,
AQUAINT is producing technology that will present answers to the user's questions.
This question-answering technology is being developed with features for managing
semantic similarity, co-reference, event characterization, opinions, linguistic
and social and world inferencing, redundancy, deception, and missing or contradictory
information. In order to allow the analyst to guide the exploration in concert
with the machine, AQUAINT technology employs interactive question-answering,
the automatic suggestion of additional paths of exploration, and the inferencing
of the social context of the information search.
Data
AQUAINT-2 Information-Retrieval Text Research
Collection is a subset of LDC's English
Gigaword Third Edition (LDC2007T07). The collection comprises approximately
2.5 GB of text (about 907K documents) spanning the time period October 2004 -
March 2006.
For each source, all of the usable data collected by LDC was processed into
a consistent XML format in which the stories for a given month are concatenated
in chronological order into a single "DOCSTREAM" element; each story
is a single "DOC" element within that stream and has a globally unique
"id" attribute.
The collection consists of newswire data in English drawn from six
distinct sources, listed below in terms of their file name
designations and full names:
|
afp_eng
|
Agence France Presse
|
|
apw_eng
|
Associated Press |
|
cna_eng
|
Central News Agency (Taiwan) English Service
|
|
ltw_eng
|
Los Angeles Times - Washington Post News Service |
|
nyt_eng
|
New York Times |
|
xin_eng
|
Xinhua News Agency (Beijing) English Service
|
Samples
For an example of the data in this publication, please examine this image of the XML.
Content Copyright:
Portions © 2004-2006 Agence France Presse, The Associated Press, Central
News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc.,
New York Times, Xinhua News Agency, © 2004-2006, 2007, 2008 Trustees of
the University of Pennsylvania |