LDC94S13A - Complete CSR-II corpus
LDC94S13B - CSR-II Sennheiser speech
LDC94S13C - CSR-II Other speech
Data
The complete WSJ1 corpus contains approximately 78,000 training
utterances (73 hours of speech), 4,000 of which are the result of
spontaneous dictation by journalists with varying degrees of
experience in dictation. The corpus contains approximately 8,200
"conventional" development test utterances (eight hours of speech), 6,800
of which are from spontaneous dictation. As with the pilot corpus,
the entire corpus was collected using two microphones, so the amount of
speech in the entire corpus is about 162 hours.
In early 1993, a "Hub and Spoke" test paradigm was designed, calling
for eleven test sets, each a specific variation of the basic or
"hub" condition. The eleven Hub and Spoke Development and
Evaluation Test sets each contain approximately 7,500 waveforms (eleven
hours of speech).
WSJ1 waveforms have been compressed by about 2:1 using the
SPHERE-embedded "Shorten" compression algorithm developed at
Cambridge University.
Updates
The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file
wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS).
Please note that even though this file has the ".z" extension,
it is not a compressed file. In order to use the file, simply ignore the ".z" extension.
Content Copyright |