Home

newspapers

The CSR (continuous speech recognition) corpus series was developed in the early 1990s under DARPA’s Spoken Language Program to support research on large-vocabulary CSR systems. 

CSR-I (WSJ0) Complete (LDC93S6A) and CSR-II (WSJ1) Complete (LDC94S13A) contain speech from a machine-readable corpus of Wall Street Journal news text. They also include spontaneous dictation by journalists of hypothetical news articles as well as transcripts.

The text in CSR-I (WSJ0) was selected to fall within either a 5,000-word subset or a 20,000-word subset. Audio includes speaker-dependent and speaker-independent sections as well as sentences with verbalized and nonverbalized punctuation. (Doddington, 1992). CSR-II features “Hub and Spoke” test sets that include a 5,000-word subset and a 64,000-word subset. Both data sets were collected using two microphones – a close-talking Sennheiser HMD414 and a second microphone of varying type. 

WSJ0 Cambridge Read News (LDC95S24) was developed by Cambridge University and consists of native British English speakers reading CSR WSJ news text, specifically, sentences from the 5,000-word and 64,000-word subsets. All speakers also recorded a common set of 18 adaptation sentences.  

The CSR corpora continue to have value for the research community. CSR-I (WSJ0) target utterances were used in the CHiME2 and CHiME3 challenges which focused on distant-microphone automatic speech recognition in real-world environments. CHiME2 WSJ0 (LDC2017S10) and CHiME2 Grid (LDC2017S07) each contain over 120 hours of English speech from a noisy living room environment. CHiME3 (LDC2017S24) consists of 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. 

CSR-I target utterances were also used in the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. DIRHA English WSJ Audio (LDC2018S01) is comprised of approximately 85 hours of real and simulated read speech from native American English speakers in an apartment setting with typical domestic background noises and inter/intra-room reverberation effects.

Multi-Channel WSJ Audio (LDC2014S03), designed to address the challenges of speech recognition in meetings, contains 100 hours of audio from British English speakers reading sentences from WSJ0 Cambridge Read News. There were three recording scenarios: a single stationary speaker, two stationary overlapping speakers, and one single moving speaker. 

All CSR corpora and their related data sets are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.