Introduction
This file contains documentation on the TDT5 Topics and Annotations, Linguistic
Data Consortium (LDC) catalog number LDC2006T19 and isbn 1-58563-418-2.
This release includes topic relevance judgments
and associated information for the TDT5 2004 evaluation topics. This
release contains complete relevance judgments, including the results of
adjudication, in which discrepancies between system submissions and LDC
annotations are reviewed and relevance judgments updated. This release also
contains answer keys for the link detection task.
The TDT5 corpora were created by Linguistic Data Consortium with support
from the DARPA TIDES (Translingual Information Detection, Extraction and
Summarization) Program. The multilingual news text corresponding to this
publication can be found in LDC Publication LDC2006T18,
TDT5 Multilingual News Text.
Data
A total of 250 topics, numbered 55001 - 55250, were annotated by LDC
using a search guided annotation technique. Details of the annotation
process are described in the annotation task definition.
Approximately 25% of the topics are monolingual English (ENG), 25% are
monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB),
and 25% are multilingual:
| 63 | ENG |
| 62 | MAN |
| 62 | ARB |
| 35 | ARB ENG MAN |
| 21 | ENG MAN |
| 7 | ARB ENG |
| 250 | | total |
Broken down by language and counting both mono- and multi-lingual
topics:
Samples
For an example of the data in this corpus, please review this sample from the link detection files.
Content Copyright
Portions © 2004, 2006 Trustees of the University of Pennsylvania |