A common problem in training and developing speech
recognition systems is scarcity of data, especially
particular phonemic contexts. The Center for Spoken
Language Understanding is attempting to address this
problem with the Names Corpus. The Names Corpus is a
collection of name utterances, both first and last names,
from several thousand different speakers over the
telephone. Name utterances are "spontaneous" in that the
subject is not reading from a word list.
Another area of active research is the development of name
Recognition systems. The Names Corpus is a useful resource
for addressing this problem.
The utterances in this corpus were taken from many other
telephone speech data collections that have been completed
at the CSLU. In most data collections, the callers were
asked to leave their name at some point. Also, the callers
would occasionally leave their name in the midst of another
utterance. The names in these situations were extracted
out of the host utterance and added to the Names Corpus.
Each file in the Names Corpus has an orthographic
transcription following the CSLU Labeling Conventions.
Also, to take advantage of the phonemic variability, many
of the utterances have been phonetically transcribed. The
selection of files to phonetically transcribe was
constrained by a process that selected files that were
suspected to contain phonetic contexts that had not yet
been transcribed.
Release 1.3 of this corpus contains 24,245 files. All of
these have been phonetically labeled. Approximately 40% of
the bigram phonemic contexts possible, without regard to
language constraints, are represented.
Samples
For an example of the data in this publication, please review this audio sample and its transcription.
Content Copyright
Portions © 2001, 2003 Speech Technology Center Ltd., © 2006 Trustees of the University of Pennsylvania |