Introduction
Voicemail Corpus Part II was produced by
Linguistic Data Consortium (LDC) catalog number LDC2002S35 and ISBN
1-58563-242-2. Voicemail Corpus Part II is a continuation of Voicemail
Corpus Part I, LDC98S77.
Data
This publication is comprised of speech and script files, and is
structured in training and evaluation data. The training data consists
of 2,048 voicemail messages and the corresponding script files. The
speech and script files are organized in 41 directories, each of which
contains up to 50 messages. The evaluation data
consists of 50 voicemail messages and 50 scripts.
The speech data is provided in sphere format; it is sampled at 8 KHz, and recorded in 8-bit ulaw,
totalling approximately 14 hours (406 MB) for training and 23 minutes (11 MB) for evaluation.
In addition to the individual script files, there are three files
which represent a concatenation of the individual scripts: train_scripts.all and eval_scripts
.all represent a
concatenation of the training and evaluation script files, one file per
line, each line beginning with the fileID. eval_scripts_filtered.all is a filtered
version of the file eval_scripts.all, after eliminating the
tagged elements (<*>) and the proper nouns marker.
Updates
A more recent version of the paper "Automatic Speech Recognition
Performance on a Voicemail Transcription Task" (M. Padmanabhan, G. Saon, J. Huang, B.
Kingsbury and L. Mangu, IEEE Transactions on Speech and Audio Processing, vol 10, number 7,
pp 433-442, October 2002) is available in both PDF and PS format by email request.
Content Copyright
Portions © 2002 International Business Machines Corporation, © 2002 Trustees of the University of Pennsylvania
Pricing
The Reduced Licensing Fee for this corpus is US$150. |