Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



West Point Russian Speech

Item Name: West Point Russian Speech
Authors: Col. Stephen A. LaRocca and Christine Tomei
LDC Catalog No.: LDC2003S05
ISBN: 1-58563-277-5
Release Date: Dec 18, 2003
Data Type: speech
Sample Rate: 22050 Hz
Sampling Format: 1-channel pcm
Data Source(s): microphone speech
Application(s): speech recognition
Language(s): Russian
Language ID(s): rus
Distribution: 1 CD
Member fee: $0 for 2003 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Col. Stephen A. LaRocca and Christine Tomei
2003
West Point Russian Speech
Linguistic Data Consortium, Philadelphia

Introduction

West Point Russian Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2003S05 and ISBN 1-58563-277-5.

The West Point Russian Speech corpus was developed at the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) at the United States Military Academy at West Point. The purpose of the corpus is to provide a set of recordings for the training and development of speaker-independent speech recognition systems for use by West Point cadets enrolled in the Russian language program.

Data

The corpus consists of 4,181 speech files in SPHERE format, totalling approximately four hours of speech. Approximately 2,290 files are from native informants and 1,891 are from non-native informants.

The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers.

Number of speakers:

malefemaletotal
native131629
non-native161026
totals292655

Number of speech files:

malefemaletotal
native102712632290
non-native11037881891
totals213020504181

The speech data was collected using laptop computers running Windows NT. Recordings were captured at a sampling rate of 16-bit at 22,050 Hz pcm using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. A visual display of the sentence, along with a digital recording of the sentence as read by a native speaker, was presented. The informant pressed the Enter key to record the utterance. The informant's recording was played back for review and the utterance was re-recorded if necessary.

The collection script consists of 96 sentences with a total of 528 tokens and 351 types.

Each waveform file has a monophone and word level master label file transcription in HTK-format. A concatenated version of the master label files at both the word level and the phone level is provided.

The lexicon contains 690 distinct orthographic word forms, including all words found in the collection script.

Updates

There are no updates available at this time.

Content Copyright

Portions © 2003 United States Military Academy, © 2003 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.