|The third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test
Collection is a three CD-ROM set that contains complete development test
and evaluation test suites for speaker-independent, large-vocabulary
speech recognition systems.
The development and evaluation tests share a common structure,
consisting of two core test components ("hubs") and seven specialized
test components ("spokes"). The hub tests, which were mandatory for all
ARPA CSR participants in the November '94 evaluations, provide a base-line for ASR performance, while the spokes provide the means for
assessing the impact of particular speaking conditions or processing
strategies in relation to base-line performance. Participants were
free to take any combination of spoke tests according to their
research interests. Taken together, the collection encompasses 180
speakers, each producing 20-40 sentences. These are organized
into two complete development test sets and one evaluation set.
The collection also includes complete documentation on the test
specifications, data collection procedures, transcriptions and
scoring protocols, together with the latest available version of NIST
software for scoring ASR results and managing SPHERE waveform files.
All speech data is accompanied by both the prompting texts and the
detailed orthographic transcriptions of the utterances.
This was the first ARPA CSR Benchmark Test in which prompting texts
were drawn from a variety of news sources. Whereas earlier
benchmarks were based on Wall Street Journal excerpts (from the
period 1987-89), CSR-III prompts come a variety of North American
Business News Services: Reuters News Service, New York Times, Wahington
Post and Los Angeles Times as well as WSJ; all texts are drawn from
financial news articles written during the period of April through June,
1994. (NAB stands for "North American Business," in contrast to earlier
benchmarks and training collections labeled "WSJ").
An important companion to the 1994 Benchmark Speech data collection is
the four-disk CSR-III Text Collection (LDC95T6), which
includes the ARPA CSR 1994 Standard Language Model. This corpus is also available from the LDC as a 1995
Because of restrictions imposed by the copyright holders of much of the NAB
text, both the speech and text collections are available to LDC members only.
For more information on how to join, send email to email@example.com.
The Reduced Licensing Fee for this corpus is US$200.