Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



HARD 2004 Text

Item Name: HARD 2004 Text
Authors: Junbo Kong, David Graff, Kazuaki Maeda, and Stephanie Strassel
LDC Catalog No.: LDC2005T28
ISBN: 1-58563-372-0
Release Date: Dec 20, 2005
Data Type: text
Data Source(s): newswire
Application(s): automatic content extraction, information retrieval
Language(s): English
Language ID(s): ENG
Distribution: 1 DVD
Member fee: $0 for 2005 members
Non-member Fee: US $1200.00
Reduced-License Fee: US $600.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Junbo Kong, et al.
2005
HARD 2004 Text
Linguistic Data Consortium, Philadelphia

Introduction

The HARD 2004 Text Corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2005T28 and ISBN 1-58563-372-0.

This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher. The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as LDC2005T29, HARD 2004 Topics and Annotations. This corpus was created with support from the DARPA TIDES Program and LDC.

Data

The corpus comprises eight English newswire and web text sources from January-December 2003. The sources are

AFE: Agence France Presse - English
APE: Associated Press Newswire
CNE: Central News Agency Taiwan - English
LAT: Los Angeles Times/Washington Post
NYT: New York Times
SLN: Salon.com
UME: Ummah Press - English
XIE: Xinhua News Agency - English

Volume of data for each source appears in the table below:

Source Stories Total Tokens Avg. Token/Story
----------------------------------------------------------
AFE: 226,515 71,829,978 317
APE: 237,067 93,294,584 393
CNE: 3,674 797,194 217
LAT: 18,287 12,576,721 687
NYT: 28,190 16,673,028 591
SLN: 3,321 4,710,500 1,418
UME: 2,607 782,064 299
XIE: 117,854 24,016,670 203

Total: 637,515 224,680,739

Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components:

- Keyword (optional), surrounded by tags
- Date/time (optional), surrounded by tags
- Headline, surrounded by tags
- Main part, surrounded by tags.

Tags are used within this part to identify paragraph boundaries.

For more information please visit the HARD Project website.

Samples

For an example of the data in this corpus, please review this sample.

Content Copyright

Portions © 2003 Agence France Presse, © 2003 The Associated Press, © 2003 Central News Agency Taiwan, © 2003 Los Angeles Times-Washington Post News Service, Inc., © 2003 The New York Times, © 2003 Salon.com, ©2003 Ummah Press Service, © 2003 Xinhua News Agency, ©2005 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.