Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



HKUST Mandarin Telephone Speech, Part 1

Item Name: HKUST Mandarin Telephone Speech, Part 1
Authors: Pascale Fung, Shudong Huang, and David Graff
LDC Catalog No.: LDC2005S15
ISBN: 1-58563-351-8
Release Date: Jul 15, 2005
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: alaw
Data Source(s): telephone speech
Project(s): EARS, GALE
Application(s): automatic content extraction
Language(s): Mandarin Chinese
Language ID(s): cmn
Distribution: 2 DVD
Member fee: $0 for 2005 members
Non-member Fee: US $3000.00
Reduced-License Fee: US $1500.00
Extra-Copy Fee: US $400.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Pascale Fung, Shudong Huang, and David Graff
2005
HKUST Mandarin Telephone Speech, Part 1
Linguistic Data Consortium, Philadelphia

Introduction

In 2004, the Hong Kong University of Science and Technology (HKUST) was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively.

Data Collection

Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. All calls were operator-assisted, namely, an operator would call two participants as scheduled to initiate a call. Subjects were asked about demographic questions before they were bridged for normal conversation. Their answers to the demographic questions were recorded on separate files.

Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subjects made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual.

Each side of a call was recorded on a separate .wav file, sampled at 8-bits (a-law encoded), 8Khz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of "date_time_Apin_Bpin.sph" and the corresponding transcript is in the same format with .txt extension.

Speaker demographics

Subjects were asked to provide several pieces of demographic information, including gender, age, native language/dialect, birthplace, education, occupation, phone type, etc. Given that Standard Mandarin is not the native dialect in many regions of China but is the official language of education and speakers may or may not have regional accents speaking Mandarin, it was decided that subjects' birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions.

Selected demographics - age, gender, birthplace, phone type and accent for each side of the call and the topic ID for the call - are provided as a tab-delimited, plain-text, tabular file.

Samples

To review an example of this corpus, please examine this audio sample.

Content Copyright

© 2005 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.