This file contains documentation on West Point Croatian Speech, Linguistic
Data Consortium (LDC) catalog number LDC2005S28 and ISBN 1-58563-359-3.
West Point Croatian Speech is a database of digital recordings of spoken Croatian . It was collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced
Language Learning (CTELL) to develop acoustic models for speech recognition systems. The US government uses These systems to provide speech
recognition enhanced language learning courseware to government linguists
and students enrolled in various government language programs. In addition,
parts of this corpus were designed to model question-answer dialogues for use
in domain-specific speech to speech translation systems.
The corpus consists of two subcorpora collected in 2000 and 2001 in Zagreb Croatia. Informants were recruited from the English department at the
University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speach, while the 2001 corpus includes free
response answers to questions in addition to read speech.
The read speech in the two subcorpora were elicited from two different
prompt scripts. Each informant in 2000 attempted to read 100 sentences from
a total of 200 carefully designed sentences. These sentences were written
by Christine Tomei. Dr. Tomei's design analysis can be found in the file
design-2000.txt. Informants in 2001 read short text passages extracted from
Croatian language webpages. Thus the scripts used to record read speech
contain a total of 6,329 distinct sentences. The read speech prompts are listed
in the files read-200.txt in the transcripts directory. Each line of these
files has two fields separated by a tab, the first denoting the base name of the
waveform file, and the second the prompt used in recording the utterence.
The read speech data are stored under the Recordings Croatian directory.
The script used to elicit free response answers contains 143 questions.
The text that was actually presented to the informants is in the file named
questions.txt in the transcripts directory. Data recorded from these prompts
are stored in the Answers Croatian directory.
The human-performed transcriptions of the informant's answers are listed
in the answers.txt file in the transcripts directory. Again, each line of this
file has two fields separated by a tab, the first field contains two numbers
separated by a slash. The first number is an identification index for the
speaker. The second number is an index to the question. The second field
on the line contains a word level transcription of the informants's answer to
the question indexed by the second number in the first field. So, for example,
in the line:
1/15 eh roena je u splitu
eh roena je u splitu is a transcription of the response speaker one gave to
question 15. The corresponding waveform file is stored in the file 15.wav in
the directory Answers Croatian1.
These recordings were transcribed by Milan Sokolich. Mr. Sokoloch also
wrote a pronouncing dictionary that includes grammatical tags. His work
is stored in the file named raw-lexicon.txt. The file lexicon.txt contains a
processed version of the raw-lexicon.txt file.
Each speaker in the 2001 subcorpus attempted to record 105 utterances
by reading 75 sentences and giving 35 free response answers to 35 questions.
Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz.
The recording script presented a visual display of the sentence to be recorded.
The informant pressed a key and spoke the sentence. The recording was
played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to
verify recordings and provide technical assistance in case of malfunctioning
For an example of the speech in this corpus, please listen to this audio sample.
Portions © 2000-2001
United States Military Academy, © 2005 Trustees of the University of Pennsylvania