Introduction
2006 NIST Spoken Term Detection Evaluation Set, Linguistic Data Consortium
(LDC) catalog number LDC2011S03 and isbn 1-58563-584-7, was compiled by researchers
at NIST (National Institute of Standards and Technology) and contains approximately eighteen hours of Arabic,
Chinese and English broadcast news, English conversational telephone speech
and English meeting room speech used in NISTs 2006
Spoken Term Detection (STD) evaluation. The STD initiative is designed to
facilitate research and development of technology for retrieving information
from archives of speech data with the goals of exploring promising new ideas
in spoken term detection, developing advanced technology incorporating these
ideas, measuring the performance of this technology and establishing a community
for the exchange of research results and technical insights.
The 2006 STD task was to find all of the occurrences of a specified term
(a sequence of one or more words) in a given corpus of speech data. The evaluation
was intended to develop technology for rapidly searching very large quantities
of audio data. Although the evaluation used modest amounts of data, it was structured
to simulate the very large data situation and to make it possible to extrapolate
the speed measurements to much larger data sets. Therefore, systems were implemented
in two phases: indexing and searching. In the indexing phase, the system processes
the speech data without knowledge of the terms. In the searching phase, the
system uses the terms, the index, and optionally the audio to detect term occurrences.
The development data is available in 2006 NIST Spoken Term Detection Development Set LDC2011S02.
Data
The evaluation corpus consists of three data genres: broadcast news (BNews),
conversational telephone speech (CTS) and conference room meetings (CONFMTG).
The broadcast news material was collected in 2003 and 2004 by LDCs
broadcast collection system from the following sources: ABC (English), Aljazeera (Arabic), China Central TV (Chinese), CNN (English), CNBC
(English), Dubai TV (Arabic), New Tang Dynasty TV (Chinese), Public Radio International (English) and Radio Free Asia (Chinese). The CTS data was taken from the Switchboard
data sets (e.g., Switchboard-2
Phase 1 LDC98S75, Switchboard-2
Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher
English Training Speech Part 1 LDC2004S13), also collected by LDC. The conference
room meeting material consists of goal-oriented, small group roundtable meetings
and was collected in 2004 and 2005 by NIST, the International Computer
Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh,
PA), TNO (The Netherlands) and Virginia Polytechnic Institute and State University (Blacksburg, VA)
as part of the AMI corpus project.
This
evaluation corpus includes scoring software. It uses the inputs described in the STD
Evaluation plan to complete the evaluation of a system.
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file.
CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files.
The CONFMTG files contain a single recorded channel.
Samples
For an example of the audio data in this corpus, please examine this audio sample.
Content Copyright
Portions © 2003 American Broadcasting Corporation, © 2003 Aljazeera, © 2003 Cable News Network,
LP, LLP, © 2004 China Central TV, © 2003 Dubai TV, © 2003 National Broadcasting Company, © 2004 New Tang Dynasty TY, © 2003 Public Radio
International, © 1998, 1999, 2003, 2004, 2011 Trustees of the University of Pennsylvania |