Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



WSJCAM0 Cambridge Read News

Item Name: WSJCAM0 Cambridge Read News
Authors: Tony Robinson, Jeroen Fransen, David Pye, Jonathan Foote, Steve Renals, Phil Woodland, and Steve Young
LDC Catalog No.: LDC95S24
ISBN: 1-58563-058-6
Data Type: speech
Sample Rate: 16000 Hz
Sampling Format: 1-channel pcm compressed
Data Source(s): microphone speech
Application(s): speech recognition
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 1995 members
Non-member Fee: US $1750.00
Reduced-License Fee: US $875.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Readme File: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Tony Robinson, et al.
1995
WSJCAM0 Cambridge Read News
Linguistic Data Consortium, Philadelphia

A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition (The Cambridge University Version of the ARPA CSR Corpus "WSJ0").

This release of WSJCA0 on CD-ROM represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of August 31, 1994. This collection is modelled directly on the initial ARPA CSR Corpus (WSJ0, a fifteen-disc corpus released by LDC in 1993): it uses the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal.

There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 are native speakers of British English and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments.

The CD-ROM publication consists of six discs, with contents organized as follows:

  • Discs 1 and 2 - training data from head-mounted microphone
  • Disc 3 - development test data from head-mounted microphone, plus first set of evaluation test data
  • Discs 4 and 5 - training data from desk-mounted microphone
  • Disc 6 - development test data from desk-mounted microphone, plus second set of evaluation test data
There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone.

Within the train and test sets, speech data are organized by speaker; prompting texts, detailed transcriptions and speaker information are included in each speaker directory.

All waveform files have NIST SPHERE headers; waveform data are compressed using the "Shorten" algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package. (This package is available via anonymous ftp from NIST, on ftp server "jaguar.ncsl.nist.gov" in the "pub" directory). Complete documentation is provided on each disc in the set.

Content Copyright


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.