This is the CD-ROM release of the Callhome Japanese Speech Corpus, produced by the Linguistic Data Consortium. This release contains speech data files ONLY, along with the minimal amount of documentation needed to describe the contents and format of the speech files, and the software packages needed to uncompress the speech data. Other components of the Callhome Japanese Corpus include transcriptions of the speech, full documentation on the transcription conventions and format, and complete auditing information on the speakers represented in the transcripts (including gender, channel quality, and so on). These text components of the corpus are obtainable from the LDC via the internet (i.e. using an FTP or World Wide Web connection to the LDC's ftp/www server). Shipment of this CD-ROM corpus has been accompanied by E-mail notification from the LDC to the original recipient of the CD-ROM set, describing how to obtain the associated text components by electronic transfer. As modifications are made to the transcriptions and other text files for this corpus, the LDC will announce the availability of an updated release of these materials to everyone who has received the corpus. If at any time you need to obtain a fresh copy of the text files for Callhome Japanese, or you want to check the status of the currently available release version for this material, please send an E-mail request to: ldc@ldc.upenn.edu Be sure to mention the corpus name, and the name of your organization. Summary of CD-ROM contents: --------------------------- 0readme.1st this file callhome/doc directory of documentation for Callhome Japanese speech data callhome/Japanese path to the speech data files, divided into train, devtest and evltest partitions shorten directory of files for Tony Robinson's "shorten" software for speech compression sphere directory of files for the NIST SPHERE software package (compression and editing utilities) Note that the partitioning of speech data into sets for "training", "development test" and "evaluation test" sets reflects the original usage of the speech files by participants in the U.S. Government- sponsored project on Large Vocabulary Conversational Speech Recognition (LVCSR). As of this release, there are 80 conversations in the training set, 20 in the development test set, and 20 in the evaluation test set. Additional (new) sets of 20 evaluation test calls will be released as the benchmark tests are carried out for this project.