Introduction
This file contains documentation on West Point Heroico Spanish Speech, Linguistic
Data Consortium (LDC) catalog number LDC2006S37 and ISBN 1-58563-391-7.
West Point Heroico Spanish Speech is a database of digital recordings of spoken Spanish. It was designed and collected by staff and faculty of
the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech
recognition systems. The U.S. government uses these systems to provide
speech-recognition enhanced language learning courseware to government
linguists and students enrolled in various government language programs.
Additionally, parts of this corpus were designed to model question/answer
dialogues for use in domain-specific speech-to-speech translation systems.
The corpus consists of two subcorpora, one collected in September 2001
at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy
in Mexico City, and the other at USMA at different times since 1997. The
USMA subcorpus includes data from non-native speakers and data collected
through a throat microphone.
Data
Two kinds of prompt scripts were used, one to elicit read speech and one
for free-response answers to questions. The read speech prompts are also divided into two groups, one designed to elicit speech typical of language
learning scenarios and the other for speech from educated native speakers.
The scripts used to record read speech have a total of 724 distinct sentences. This number includes 205 short, simple sentences used in typical
language learning scenarios. The other 519 sentences were extracted from
lecture notes used at USMA in a military readings course. All of the read
speech prompts are listed in two files in the transcripts directory: HEROICO-
Recordings.txt and USMA-prompts.txt, containing the sentences read by informants at the Mexican Military Academy and USMA, respectively. Each
line of these files has two fields separated by a tab, the first denoting the base
name of the waveform file, and the second the prompt used in recording the
utterence.
The read speech data collected from informants at HEROICO are stored
in the HEROICO/Recordings Spanish directory.
The script used to elicit free-response answers contains 143 questions.
The text that was actually presented to the informants is in the file named
questions.txt in the transcripts directory. Data recorded from these prompts
are stored in the HEROICO/Answers Spanish directory.
The human-performed transcriptions of the informants answers are listed
in the HEROICO-Answers.txt file in the transcripts directory. Again, each
line of this file has two fields separated by a tab the first field contains
two numbers separated by a slash. The first number is an identification
index for the speaker. The second number is an index to the question. The
second field on the line contains a word level transcription of the informants
answer to the question indexed by the second number in the first field. So
for example in the line:
100/10 no ella no tiene barba ni bigote
no ella no tiene barba ni bigote is a transcription of the response speaker
100 gave to question 10. The corresponding waveform file is stored in the file
10.wav in the directory HEROICOAnswers Spanish100.
Each speaker in the HEROICO subcorpus attempted to record 100 utter-
ances by reading 75 sentences and giving 25 free-response answers to questions.
Both native and non-native USMA informatnts read from the list of 205
simple sentences. The prompts used in the USMA subcorpus are listed in
the file USMA-prompts.txt in the transcripts directory. This file has the
same two-field format as the above transcription files. Some of the USMA
informants wore an additional throat microphone. That data was recorded
in a separate stream and stored in files whose names begin with the letter t.
Data collected at USMA are stored under the USMA directory. The names
of the directories under the USMA directory indicate whether the speaker
was native or non-native. The speakers native country is also indicated in
the case of native speakers.
Speech data was collected at HEROICO using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling
rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence.
The recording was played back for review allowing the utterance to be re-
recorded. A member of the data collection team was on hand during the
recording session to verify recordings and provide technical assistance in case
of malfunctioning equipment.
The data from USMA was collected using several different microphones
and formats. Most of the data were recorded on Pentium computers running
Linux through an m-10 Shuer head-mounted microphone. Entropics ESPS
programs were used in most cases, especially when both head-mounted and
throat microphones were used.
Samples
For an example of the data in this corpus, please listen to this audio sample.
Content Copyright
Portions © 2001 United States Military Academy, © 2006 Trustees of the University of Pennsylvania |