Introduction
The HARD 2004 Text Corpus was produced by Linguistic Data Consortium
(LDC), catalog number LDC2005T28 and ISBN 1-58563-372-0.
This corpus contains source data for the 2004 TREC HARD (High Accuracy
Retrieval from Documents) Evaluation. HARD 2004 was a track within the
NIST Text REtrieval Conference (TREC), with the objective of achieving
high accuracy retrieval from documents by leveraging additional
information about the searcher and/or the search context, through
techniques like passage retrieval and the use of targeted interaction with
the searcher. The current corpus was previously distributed to HARD
Participants as LDC2004E30. The topics and annotations that correspond to
this release are distributed as LDC2005T29, HARD 2004 Topics and
Annotations. This corpus was created with support from the DARPA TIDES
Program and LDC.
Data
The corpus comprises eight English newswire and web text sources
from January-December 2003. The sources are
AFE: Agence France Presse - English
APE: Associated Press Newswire
CNE: Central News Agency Taiwan - English
LAT: Los Angeles Times/Washington Post
NYT: New York Times
SLN: Salon.com
UME: Ummah Press - English
XIE: Xinhua News Agency - English
Volume of data for each source appears in the table below:
Source
Stories Total Tokens Avg.
Token/Story
----------------------------------------------------------
AFE:
226,515
71,829,978 317
APE:
237,067
93,294,584 393
CNE:
3,674
797,194 217
LAT:
18,287
12,576,721 687
NYT:
28,190
16,673,028 591
SLN:
3,321
4,710,500 1,418
UME:
2,607
782,064 299
XIE:
117,854
24,016,670 203
Total:
637,515 224,680,739
Files are organized by source on a daily basis. Each file contains
multiple documents identified by unique document IDs, in the form
"SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from
"0001" for each source/day. In addition, each document has some or all
of the following components:
- Keyword (optional), surrounded
by tags
- Date/time (optional), surrounded
by tags
- Headline, surrounded by
tags
- Main part, surrounded by
tags. Tags are used within
this part to identify
paragraph boundaries.
For more information please visit the HARD Project
website.
Samples
For an example of the data in this corpus, please review this sample.
Content Copyright
Portions © 2003 Agence France Presse, © 2003 The Associated Press, ©
2003 Central News Agency Taiwan, © 2003 Los Angeles Times-Washington Post
News Service, Inc., © 2003 The New York Times, © 2003 Salon.com, ©2003 Ummah Press Service, © 2003 Xinhua News Agency, ©2005 Trustees of the University of Pennsylvania |