Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



CSLU: Multilanguage Telephone Speech Version 1.2

Item Name: CSLU: Multilanguage Telephone Speech Version 1.2
Authors: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika
LDC Catalog No.: LDC2006S35
ISBN: 1-58563-390-9
Release Date: Jun 15, 2006
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: pcm
Data Source(s): telephone speech
Application(s): language identification, machine translation
Language(s): English, French, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil, Vietnamese, Western Farsi
Language ID(s): cmn, deu, eng, fra, hin, jpn, kor, pes, spa, tam, vie
Distribution: 1 DVD
Member fee: $0 for 2006 members
Non-member Fee: US$150.00
Reduced-License Fee: US$150.00
Extra-Copy Fee: US$150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika
2006
CSLU: Multilanguage Telephone Speech Version 1.2
Linguistic Data Consortium, Philadelphia

Introduction

The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (eg. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, for a total of about 38.5 hours of speech. Time-aligned phonetic transcriptions for 619 of the utterances are also included.

Data

Each subject called the CSLU data collection system by dialing a toll-free number. An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 khz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file.

Samples

For an example of the data in this corpus, please listen to these audio samples in Tamil and English.

Content Copyright

Portions © 1992, 2000, 2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2006 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.