Introduction
West Point Arabic Speech was produced by the
Linguistic Data Consortium (LDC), catalog number LDC2002S02 and ISBN 1-58563-199-x.
West Point Arabic Speech contains speech data that was
collected and processed by members of the Department of Foreign
languages at the United States Military Academy at West Point and the
Center For Technology Enhanced Language Learning (CTELL) as part of
an effort called "Project Santiago." The original purpose of this
corpus was to train acoustic models for automatic speech recognition
that could be used as an aid in teaching Arabic to West Point cadets.
Data
The corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42
hours of speech data. Each speech file represents one person reciting one
prompt from one of four prompt scripts. The utterances were recorded using a
Shure SM10A microphone and a RANE Model MS1 pre-amplifier. The files were
recorded as 16-bit PCM low-byte-first ("little-endian") raw audio files, with
a sampling rate of 22.05 KHz. They were then converted to NIST sphere format.
Approximately 7,200 of the recordings are from native informants and 1200
files are from non-native informants. The following tables show the breakdown
of corpus content in terms of male, female, native and non-native speakers.
number of speakers
| | male | female | total |
| native: | 41 |
34 | 75 |
| non-native: | 25 | 10 | 35 |
| totals: | 66 | 44 | 110 |
hours of data
| | male | female | total |
| native: | 6.0 | 4.4 | 10.4 |
| non-native: | 0.74 | 0.28 | 1.02 |
| totals: | 6.74 | 4.68 | 11.42 |
megabytes of data
| | male | female | total |
| native: | 918 | 667 | 1585 |
| non-native: | 111.9 | 42.8 | 154.7 |
| totals: | 1029.9 | 709.8 | 1739.7 |
number of speech files
| | male | female | total |
| native: | 4107 | 3163 | 7270 |
| non-native: | 883 | 363 | 1246 |
| totals: | 4990 | 3526 | 8516 |
Some of the recording sessions include a handful of utterances that were
cut short due to pronunciation mistakes or unexpected interruptions
(e.g. phones ringing, doors slamming, etc). These partial utterances have been
retained in the waveform directories and are distinguished from the
full-sentence recordings by having a trailing "-u" in the filename, before the
extension (e.g. "s1_080-u.sph" instead of "s1_080.sph"). The above tables
describe all data; both the complete and partial utterances are accounted for.
168 of the 8,516 speech files are partial utterances, and the remaining 8,348 are
complete.
Updates
There are no updates at this time.
Content Copyright
Portions © 2002 United States Military Academy, © 2002 Trustees of the University of Pennsylvania
The SANTIAGO Arabic corpus was developed at the United States Military
Academy. All information contained herein is the sole and exclusive property of
the United States Military Academy. |