Multi-Language Telephone Speech Corpus Distribution
release January, 1994
Copyright 1994
Center for Spoken Language Understanding
Oregon Graduate Institute of Science & Technology
This disc contains the following files and directories:
- readme.txt - this text file
- overview.txt - file describing the OGI
Multi-language Telephone Speech Corpus in detail
- calls/ - A directory containing the directories for each language,
which in turn contain the speech files for the respective
languages. Therefore, the directory structure for calls/ will
look like the following:
english/ french/ hindi/ korean/ spanish/ vietnam/
farsi/ german/ japanese/ mandarin/ tamil/
The speech files are organized according to a call number
system (see 4 below, i.e. data.doc). For ease of file
handling, the files are divided in groups of 10 per directory.
For example, the farsi/ directory contains the directories:
00/ 02/ 04/ 06/ 08/ 10/ 12/ 14/
01/ 03/ 05/ 07/ 09/ 11/ 13/ 15/
Directory 00 contains calls identified by call numbers 0-9;
directory 15 contains calls identified by call numbers
150-159; etc.
- doc/ - directory containing the following documentation files:
- data.doc -- file containing the conventions used for
naming
the data files
- formats.doc -- file containing
documentation on the speech (.wav) and transcription (.ptlola) file formats.
- header.doc -- file giving details of the NIST
SPHERE header
structure
- ph1_logs.doc -- file describing the contents of the
.log files
created during Phase I of the development process
- ph2_logs.doc -- file describing the contents of the
.log2
files created during Phase II of the development
process
- mltlngdb.ps -- postscript file containing the
article: "The
OGI Multi-language Telephone Speech Corpus" Y. K.
Muthusamy, R. A. Cole and B. T. Oshika Proceedings of
the International Conference on Spoken Language
Processing, Banff, Alberta, Canada, October 1992.
- seglola/ - directory containing broad phonetic transcriptions (i.e.
.seg files --- see overview.txt and doc/formats.doc for more
information on these files).
- logs/ - directory containing logfiles of corpus development, Phase
I, which consisted of preliminary verification, chopping,
evaluation and broad phonetic transcriptions of each utterance
- logs2/ - directory containing logfiles of corpus development, Phase
II, consisting of verification and evaluation of calls by
native speakers of each individual language
- trn_test/ - directory containing files describing the training,
development and test sets used by Yeshwant Muthusamy for his
Ph.D. Thesis research.
- sphere/ - directory containing files needed to uncompress the .wav
files
PLEASE NOTE:
This publication of the OGI Multi-Language Telephone Speech Corpus,
produced on CD-ROM by the Linguistic Data Consortium, contains a few
minor modifications relative to the version distributed on tape by
OGI. To begin with, directory and file names have been simplified
where necessary to conform to ISO 9660 conventions for file naming.
In addition, we have included the more current SPHERE package (version
2.0) from NIST, and have applied a more effective waveform compression
algorithm (the "shorten" compression method developed by Tony Robinson
of Cambridge University, as implemented in the current release of
SPHERE). In performing this conversion of the waveform data, we also
supplemented the information in each file's SPHERE header to include
common header fields that were missing from the original files (sample
min & max, sample coding). Relevant changes to the various log and
documentation files have been made as necessary.