Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



HARD 2004 Topics and Annotations

Item Name: HARD 2004 Topics and Annotations
Authors: Stephanie Strassel and Meghan Glenn
LDC Catalog No.: LDC2005T29
ISBN: 1-58563-373-9
Release Date: Dec 20, 2005
Data Source(s): newswire
Application(s): automatic content extraction, information detection, information extraction, information retrieval, topic detection and tracking
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2005 members
Non-member Fee: US $350.00
Reduced-License Fee: US $175.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Stephanie Strassel and Meghan Glenn
2005
HARD 2004 Topics and Annotations
Linguistic Data Consortium, Philadelphia

Introduction

The HARD 2004 Topics and Annotations Corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2005T29 and ISBN 1-58563-373-9. This corpus contains topics and annotations (clarification forms, responses and relevance assessments) for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher.

The current corpus was previously distributed to HARD Participants as LDC2004E42 and LDC2005E17. The source data that corresponds to this release is distributed as LDC2005T28, HARD 2004 Text. This corpus was created with support from the DARPA TIDES Program and LDC.

Data

Three major annotation tasks are represented in this release: Topic Creation, Clarification Form Responses, and Relevance Assessment. Topics include a short title, query plus context, and a number of limiting parameters known as "metadata" which include targeted geographical region, target data domain or genre, and level of searcher expertise. Clarification Forms are brief HTML questionnaires system developers submitted to LDC searchers to glean additional information about information needs directly from the topic creators. Relevance assessment consisted of adjudication of pooled system responses, and included document-level judgments for all topics, and passage-level relevance judgments for a subset of topics.

The release is divided into training and evaluation resources. The training set comprises twenty-one topics and 100 document-level relevance judgments per topic. The evaluation set contains fifty topics, clarification forms and responses, document-level relevance assessment for all topics and passage-level judgments for half of the topics. HARD participants received the reference data over the course of the evaluation cycle in stages: (0) training topics, (1) evaluation topic descriptions without metadata, (2) clarification form responses, (3) topic descriptions with metadata, and (4) relevance assessments.

For more information please consult the HARD Project website.

Samples

For an example of the data in this publication, please review the following samples:

Content Copyright

© 2005 Trustees of the University of Pennsylvania.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.