The TDT corpus includes approximately 16,000 stories about half collected from Reuters newswire and half from CNN broadcast news transcripts during the period July 1, 1994 to June 30, 1995. An integral and key part of the corpus is the annotation in terms of news events discussed in the stories. Twenty-five events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. Annotation data for these events are included in the corpus and provide a basis for training TDT systems.
For the Pilot Study, copies of both the Corpus Documentation (ps) and the Evaluation Plan (ps) are available in PostScript format.
LDC distributes the TDT Pilot Corpus. For ordering information, see the TDT Pilot Corpus entry in our catalog.