Introduction
CSLU: ISOLET Spoken Letter Database Version 1.3,
Linguistic Data Consortium (LDC) catalog number LDC2008S07 and isbn 1-58563-488-3,
was created by the Center for Spoken Language Understanding (CSLU) at OGI School
of Science and Engineering, Oregon Health and Science University, Beaverton,
Oregon.
CSLU: ISOLET Spoken Letter Database Version 1.3
is a database of letters of the English alphabet spoken in isolation under quiet
laboratory conditions and associated transcripts. The data was collected in
1990 and consists of two productions of each letter by 150 speakers (7800 spoken
letters) for approximately 1.25 hours of speech. The subjects were recruited
through advertising and consisted of 75 male speakers and 75 female speakers.
Each subject received a free dessert at a local restaurant in exchange for his
or her participation in the data collection. All speakers reported English as
their native language. Their ages varied from 14 to 72 years; the speakers'
average age was 35 years.
Data
Speech was recorded in the OGI speech recognition laboratory. The room measured
15' by 15' with a tile floor, standard office wall board and drop ceiling and
contained two Sun workstations and three disk drives.
The recording equipment was selected to mimic the equipment used to collect
the TIMIT
database as closely as possible. The speech was recorded with a Sennheiser HMD
224 noise-canceling microphone, low pass filtered at 7.6 kHz. Data capture was
performed using the AT&T DSP32 board installed in a Sun 4/110. The data
were sampled at 16 kHz and converted to RIFF(.WAV) format.
The subjects were seated in front of a Sun workstation and prompted with letters
in random order. After each prompt, the subject would strike the return
key and say the letter. Two seconds of speech were recorded and immediately
played back for verification. If the subject spoke too soon or too late and
missed the two-second buffer, or if the experimenter or subject decided that
the letter was misspoken, the recording was repeated. There was no attempt to
elicit ideal speech. A letter was judged to be misspoken only if there was a
significant departure from normal pronunciation.
After the recording session, each utterance was verified by a human examiner
for two determinations. First, the examiner viewed a waveform of the utterance
to determine that the speech was padded with silence. The examiner then listened
to the speech and noted any ambiguous or misspoken utterances. All utterances
noted by the examiner were examined by two additional human examiners. If a
majority of the examiners perceived that an utterance was abnormal, that utterance,
and the rest of the utterances from that speaker, were removed from the corpus.
The transcriptions of the recorded speech are time-aligned phonetic transcriptions
conforming to the CSLU Labeling standards. Time-aligned
word transcriptions are represented in a standard orthography or romanization.
Speech and non-speech phenomena are distinguished. The transcriptions are aligned
to a waveform by placing boundaries to mark the beginning and ending of words.
In addition to the specification of boundaries, this level of transcription
includes additional commentary on salient speech and non-speech characteristics,
such as glottalization, inhalation, and exhalation.
Samples
For an example of the data in this corpus, please listen to this audio sample (.WAV) of a speaker speaking the letter "a". The labeling for this sample can be seen below:
MillisecondsPerFrame: 1.000000
END OF HEADER
0 95 .pau
95 285 ^
285 425 .pau
Content Copyright
Portions © 1990, 1996, 2000, 2002 Center for Spoken Language Understanding,
Oregon Health and Science University, © 2008 Trustees of the University
of Pennsylvania |