Introduction
Topic Detection and Tracking (TDT) refers to automatic techniques for
finding topically related material in streams of data such as newswire and
broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find
topically homogeneous sections (segmentation), detect the occurrrence of new
events (detection), and track the reoccurrencce of old or new events
(tracking). For further information on TDT2 please visit our TDT2 Information Pages.
Data
TDT2 Multilanguage Text Corpus Version 4.0 contains news data collected
daily from nine news sources in two languages (American English and Mandarin
Chinese), over a period of six months (January - June 1998). Both
manually-created reference text and automatically- generated text (ASR and/or
machine translation) are provided for all broadcast and all Mandarin data.
This version has been prepared to complement the first general release
of the TDT3 Multilanguage Text Corpus, providing new enhancements to
make the data content more accessible to a broader research community.
The news sources and approximate number of stories per source
(in thousands) are as follows:
English sources Thousands of stories
-----------------------------------------------------------------
New York Times Newswire Service 11.8
Associated Press Worldstream Service 12.8
Cable News Network, "Headline News" 15.8
American Broadcasting Co., "World News Tonight" 2.1
Public Radio International, "The World" 2.9
Voice of America, English news programs 8.2
Total English stories: 53.6 thousand
Mandarin sources
-----------------------------------------------------------------
Xinhua News Agency 11.3
Zaobao News Agency 5.2
Voice of America, Mandarin Chinese news programs 2.3
Total Mandarin stories: 18.8 thousand
This release consists of the English and Mandarin text components of the
TDT2 corpus. The data was collected daily over a period of six months
(January-June 1998) from the following sources.
- American Broadcasting Company (ABC)
- Associated Press
- Cable News Network, Inc. (CNN)
- New York Times
- Public Radio International (PRI)
- Voice of America (VOA)
- Xinhua News Agency
- ZaoBao News
The data is provided in the following formats.
.sgm: Reference "true-text," with markup providing story boundaries and
descriptive information
.tkn: Tokenized version of sgml data, with all descriptive and boundary
information removed
.as0: Output of the Dragon ASR system in tokenized form with information
on timing, speaker clusters, and confidence
.as1: Output of the BBN ASR system in tokenized form with timing
information (English Only)
.mttkn: SYSTRAN output from .tkn (Mandarin Only)
.mtas0: SYSTRAN output from .as0 (Mandarin Only)
The corpus also includes topic relevance tables as well as tables for
locating story boundaries.
Updates
There are no updates at this time.
Content Copyright
Portions © 1998 American Broadcasting Company, The Associated Press, Cable News Network, LP, LLLP, New York Times, Public Radio International, SPH AsiaOne Ltd, Xinhua News Agency, © 1998-2001 Trustees of the University of Pennsylvania
The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. |