Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



AQUAINT-2 Information-Retrieval Text Research Collection

Item Name: AQUAINT-2 Information-Retrieval Text Research Collection
Authors: Ellen Vorhees and David Graff
LDC Catalog No.: LDC2008T25
ISBN: 1-58563-494-8
Release Date: Dec 19, 2008
Data Type: text
Data Source(s): newswire
Project(s): AQUAINT
Application(s): information retrieval
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 2008 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Ellen Vorhees and David Graff
2008
AQUAINT-2 Information-Retrieval Text Research Collection
Linguistic Data Consortium, Philadelphia


Introduction:

AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).

The AQUAINT (Advanced Question-Answering for Intelligence)  program addresses interactivity with scenarios or tasks. The scenario provides a context in which questions will be asked and answered, and the task reflects the overall assignment. The program is committed to solve a single problem: how to find topically relevant, semantically related, timely information in massive amounts of data in diverse languages, formats, and genres.

AQUAINT technology is advancing the development of components and functions that allows users to pose a series of intertwined, complex questions and obtain comprehensive answers in the context of broad information-gathering tasks. In addition, while most information retrieval systems present only links to documents, AQUAINT is producing technology that will present answers to the user's questions. This question-answering technology is being developed with features for managing semantic similarity, co-reference, event characterization, opinions, linguistic and social and world inferencing, redundancy, deception, and missing or contradictory information. In order to allow the analyst to guide the exploration in concert with the machine, AQUAINT technology employs interactive question-answering, the automatic suggestion of additional paths of exploration, and the inferencing of the social context of the information search.

Data

AQUAINT-2 Information-Retrieval Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07). The collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period October 2004 - March 2006.

 For each source, all of the usable data collected by LDC was processed into a consistent XML format in which the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream and has a globally unique "id" attribute.

The collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names:
afp_eng Agence France Presse
apw_eng Associated Press
cna_eng Central News Agency (Taiwan) English Service
ltw_eng Los Angeles Times - Washington Post News Service
nyt_eng New York Times
xin_eng Xinhua News Agency (Beijing) English Service

Samples

For an example of the data in this publication, please examine this image of the XML.

Content Copyright:

Portions © 2004-2006 Agence France Presse, The Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc., New York Times, Xinhua News Agency, © 2004-2006, 2007, 2008 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.