Introduction
The TDT Pilot Study corpus was created to support an initiative in "topic
detection and tracking." This initiative is directed toward computer
processing of language data, both text and speech. The objective is
namely to explore techniques for detecting the appearance of new and
unexpected topics and for tracking the reappearance and evaluation of
them.
Data
The TDT corpus comprises a set of stories that includes both newswire
(text) and broadcast news (speech). Each story is represented as a
stream of text, in which the text is either taken directly from the
newswire (Reuters) or is a manual transcription of the broadcast news
speech (CNN). The corpus spans the period from July 1, 1994 to June
30, 1995. It contains approximately 16,000 stories, with about half
taken from Reuters newswire and half from CNN broadcast news
transcripts.
An integral and key part of the corpus is the annotation of the corpus
in terms of the events discussed in the stories. 25 events
were defined that span a variety of event types and that cover a
subset of the events discussed in the corpus stories. Annotation data
for these events are included in the corpus and provide a basis for
training TDT systems.
Updates
There are no updates at this time.
Copyright
Portions © 1994-1995 Cable News Network, LP, LLLP, © 1994-1995 Reuters America, Inc., © 1998 Trustees of the University of Pennsylvania |