Introduction:
CSLU: Numbers Version 1.3, Linguistic Data Consortium
(LDC) catalog number LDC2009S01 and isbn 1-58563-501-4, was created by the Center
for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering,
Oregon Health and Science University, Beaverton, Oregon. It is a collection
of naturally produced numbers taken from utterances in various CSLU telephone
speech data collections. The corpus consists of approximately fifteen hours
of speech and includes isolated digit strings, continuous digit strings, and
ordinal/cardinal numbers.
The numbers have several sources, among them,
phone numbers, numbers from street addresses and zip codes, uttered by 12618 speakers in a total of 23902 files.
In most of CSLU's telephone data collections, callers were asked for their phone
number, birthdate or zip code. Callers would also occasionally leave numbers
in the midst of another utterance. The numbers in those situations were extracted
from the host utterance and added to the corpus.
Additional information about
this publication is available from the corpus web page at CSLU.
Data:
The speech data was collected over analog and
digital telephone lines. The analog data was recorded
using a Gradient Technologies analog-to-digital conversion box; those files
were recorded as 16-bit, 8 khz and stored in a linear format. The digital data
was recorded with the CSLU T1 digital data collection system; those files were
sampled at 8khz, 8-bit and stored as ulaw files. All of the data in this
release has been linearly encoded in 16-bit RIFF standard file format.
Each file includes an orthographic transcription
following the CSLU Labeling guidelines which are included in the documentation
for this publication. Also, many of the utterances have been phonetically labeled.
Statistics:
CSLU: Numbers Version 1.3 consists
of approximately fifteen hours of speech. The following table gives a count of
the number of files for each utterance type.
|
Type
|
Number
|
phone
|
2970
|
street
|
7079
|
zipcode
|
7076
|
other
|
6771
|
Samples:
For an example of the data contained in this corpus, please examine the audio files and labels for the following spoken sequences
Content Copyright:
Portions © 1998, 2000, 2002 Center
for Spoken Language Understanding, Oregon Health & Science University,
© 2009 Trustees of the University of Pennsylvania |