Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Voicemail Corpus Part II

Item Name: Voicemail Corpus Part II
Authors: Mukund Padmanabhan, Brian Kingsbury, Bhuvana Ramabhadran, Jing Huang, Stanley Chen, George Saon, and Lidia Mangu
LDC Catalog No.: LDC2002S35
ISBN: 1-58563-242-2
Release Date: Nov 08, 2002
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: ulaw
Data Source(s): telephone speech
Application(s): speech recognition
Language(s): English
Language ID(s): eng
Distribution: 1 CD
Member fee: $0 for 2002 members
Non-member Fee: N/A (Members Only)
Reduced-License Fee: N/A
Extra-Copy Fee: US $150.00
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Mukund Padmanabhan, et al.
2002
Voicemail Corpus Part II
Linguistic Data Consortium, Philadelphia

Introduction

Voicemail Corpus Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2002S35 and ISBN 1-58563-242-2. Voicemail Corpus Part II is a continuation of Voicemail Corpus Part I, LDC98S77.

Data

This publication is comprised of speech and script files, and is structured in training and evaluation data. The training data consists of 2,048 voicemail messages and the corresponding script files. The speech and script files are organized in 41 directories, each of which contains up to 50 messages. The evaluation data consists of 50 voicemail messages and 50 scripts.

The speech data is provided in sphere format; it is sampled at 8 KHz, and recorded in 8-bit ulaw, totalling approximately 14 hours (406 MB) for training and 23 minutes (11 MB) for evaluation.

In addition to the individual script files, there are three files which represent a concatenation of the individual scripts: train_scripts.all and eval_scripts .all represent a concatenation of the training and evaluation script files, one file per line, each line beginning with the fileID. eval_scripts_filtered.all is a filtered version of the file eval_scripts.all, after eliminating the tagged elements (<*>) and the proper nouns marker.

Updates

A more recent version of the paper "Automatic Speech Recognition Performance on a Voicemail Transcription Task" (M. Padmanabhan, G. Saon, J. Huang, B. Kingsbury and L. Mangu, IEEE Transactions on Speech and Audio Processing, vol 10, number 7, pp 433-442, October 2002) is available in both PDF and PS format by email request.

Content Copyright

Portions © 2002 International Business Machines Corporation, © 2002 Trustees of the University of Pennsylvania

Pricing

The Reduced Licensing Fee for this corpus is US$150.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.