Multi-Language Telephone Speech Corpus Distribution

release January, 1994

Copyright 1994
Center for Spoken Language Understanding
Oregon Graduate Institute of Science & Technology

This disc contains the following files and directories:

  1. readme.txt - this text file
  2. overview.txt - file describing the OGI Multi-language Telephone Speech Corpus in detail
  3. calls/ - A directory containing the directories for each language, which in turn contain the speech files for the respective languages. Therefore, the directory structure for calls/ will look like the following:
    english/  french/  hindi/     korean/    spanish/   vietnam/
    farsi/    german/  japanese/  mandarin/  tamil/
    
    The speech files are organized according to a call number system (see 4 below, i.e. data.doc). For ease of file handling, the files are divided in groups of 10 per directory. For example, the farsi/ directory contains the directories:
    00/     02/     04/     06/     08/     10/     12/     14/
    01/     03/     05/     07/     09/     11/     13/     15/
    
    Directory 00 contains calls identified by call numbers 0-9; directory 15 contains calls identified by call numbers 150-159; etc.
  4. doc/ - directory containing the following documentation files:
  5. seglola/ - directory containing broad phonetic transcriptions (i.e. .seg files --- see overview.txt and doc/formats.doc for more information on these files).
  6. logs/ - directory containing logfiles of corpus development, Phase I, which consisted of preliminary verification, chopping, evaluation and broad phonetic transcriptions of each utterance
  7. logs2/ - directory containing logfiles of corpus development, Phase II, consisting of verification and evaluation of calls by native speakers of each individual language
  8. trn_test/ - directory containing files describing the training, development and test sets used by Yeshwant Muthusamy for his Ph.D. Thesis research.
  9. sphere/ - directory containing files needed to uncompress the .wav files

PLEASE NOTE:

This publication of the OGI Multi-Language Telephone Speech Corpus, produced on CD-ROM by the Linguistic Data Consortium, contains a few minor modifications relative to the version distributed on tape by OGI. To begin with, directory and file names have been simplified where necessary to conform to ISO 9660 conventions for file naming. In addition, we have included the more current SPHERE package (version 2.0) from NIST, and have applied a more effective waveform compression algorithm (the "shorten" compression method developed by Tony Robinson of Cambridge University, as implemented in the current release of SPHERE). In performing this conversion of the waveform data, we also supplemented the information in each file's SPHERE header to include common header fields that were missing from the original files (sample min & max, sample coding). Relevant changes to the various log and documentation files have been made as necessary.