Introduction
2002 Rich Transcription Broadcast News and Conversational Telephone Speech was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004S11 and ISBN 1-58563-311-9.
This corpus contains the test material used in the 2002 Rich Transcription (RT-02)
Evaluation of Broadcast News and Conversational Telephone Speech, administered
by the NIST Speech Group in the Spring of 2002.
The RT-02 Meeting Recognition Evaluation material is available in a separate
distribution. For complete up-to-date information, see the RT-02 Evaluation Website.
The RT-02 Evaluation supported two main evaluation tasks:
- Speech-To-Text (STT) Tasks -- included three processing speeds (1x
real time, 10x real time, and unlimited time) for both the Broadcast News
(BN) and Conversational Telephone Speech (CTS) domains.
- Metadata Extraction (MDE) Task -- consisted of a speaker diarization
task for the BN and CTS domains.
Data
This distribution of the RT-02 Evaluation Data contains only Broadcast News
and Conversational Telephone Speech data. Meeting data used in the RT-02 Evaluation
is not included in this distribution and is packaged in a separate distribution.
All recordings are in English.
The BN data is composed of six approximately 10-minute excerpts from six
different broadcasts. Each waveform is a SPHERE-headered, single-channel,
16-bit PCM file. The broadcasts were selected from programs from MNB, PRI, NBC, CNN, VOA and ABC, all collected
in 1998. The evaluation excerpts were transcribed to the nearest
story boundary.
The CTS data is composed of 60 approximately five-minute excerpts from 60 different
conversations: 20 from Switchboard-1 data, 20 from Switchboard-2 data,
and 20 from Switchboard Cellular-2 data. Evaluation excerpts were transcribed
to the nearest turn. Unlike the BN audio files where the full broadcasts were provided, the CTS
audio files contain only the evaluation excerpts. Each audio excerpt is a
SPHERE-headered, two channel interleaved 8-bit mulaw file.
The reference transcripts are also provided in this corpus. The official format for STT reference data is STM
(files with the extension 'stm'), while the official format for MDE reference
data is RTTM (files with the extension 'rttm') . Files with the extensions 'txt'
or 'utf' are the original reference transcripts before any format conversions,
additions of annotations, etc., and are included for completeness.
Samples
Please examine this example
to review a sample of this corpus.
Updates
There are no updates available at this time.
Content Copyright
Portions © 2004 Trustees of the University of Pennsylvania,
© 1998 American Broadcasting Company,
© 1998 National Broadcasting Company, Inc.,
© 1998 Cable News Network LP, LLP. All Rights Reserved,
© 1998 Public Radio International.
The World is the co-production of Public
Radio International and the British Broadcasting Corporation and is
produced at WGBH Boston. |