Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



West Point Brazilian Portuguese Speech

Item Name: West Point Brazilian Portuguese Speech
Authors: John Morgan, Sheila Ackerlind, Sterling Packer
LDC Catalog No.: LDC2008S04
ISBN: 1-58563-471-9
Release Date: May 19, 2008
Data Type: speech
Sample Rate: 22050 Hz
Sampling Format: pcm
Data Source(s): microphone speech
Application(s): speech recognition
Language(s): Portuguese
Language ID(s): POR
Distribution: 1 DVD
Member fee: $0 for 2008 members
Non-member Fee: US$500.00
Reduced-License Fee: N/A
Extra-Copy Fee: US$200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: John Morgan, Sheila Ackerlind, Sterling Packer
2008
West Point Brazilian Portuguese Speech
Linguistic Data Consortium, Philadelphia

Introduction

West Point Brazilian Portuguese Speech is a database of digital recordings of spoken Brazilian Portuguese designed and collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition systems. The U.S. government uses such systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs.

The data in this corpus was collected in March 1999 in Brasilia, Brazil using informants from a Brazilian military academy. The corpus consists of read speech from 60 female and 68 male native and non-native speakers.

The speech was elicited from a prompt script containing 296 sentences and phrases typically used in language learning situations. The prompts are listed in the file prompts.txt. Each line of this file has two fields separated by a tab: the first field denotes the base name of the waveform file; and the second field denotes the prompt used to record the utterance.

A pronouncing dictionary developed by Dr. Sheila Ackerlind with help from cadet Sterling Packer is provided in the file SANTIAGO.txt.

The speech was collected using four laptop computers running MS Windows. Three of the computers recorded with a 16 bit data size and sampling rate of 22050 Hz, the other laptop recorded with an 8 bit data size at a sampling rate of 11025 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review, allowing the utterance to be re-recorded. A member of the data collection team was present during the recording session to verify recordings and to provide technical assistance in case of malfunctioning equipment.

Samples

For an example of speech contained in this corpus, please listen to this audio sample (MS Wave format).

Copyright

Portions © 1999, 2004 United States Military Academy, © 2008 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.