| The CALLHOME Spanish corpus
of telephone speech consists of 120 unscripted telephone conversations
between native speakers of Spanish.
All calls, which lasted up to 30 minutes, originated in North America
and were placed to international locations. Most participants called
family members or close friends.
This corpus contains speech data files ONLY, along with the minimal
amount of documentation needed to describe the contents and format of
the speech files and the software packages needed to uncompress the
speech data. The transcripts and documentation (LDC96T17) are
available separately, as is an associated lexicon (LDC96L16).
Updates
The "shorten" and "sphere" directories have been removed.
The sphere directory contained NIST "SPeech HEader REsources" (SPHERE):
C-language source code libraries and utilities for manipulating NIST
SPHERE-format waveform files.
The shorten directory contained files for Tony Robinson's "shorten"
software for speech compression.
A more recent version of the SPHERE utilities is now available on the
NIST web site; additional utilities for
converting from SPHERE to other waveform file formats is also available
at the LDC web site.
10.10.2003: It has been brought to our attention that 16 sphere files (both from the train and devtest
directories) were corrupted; the problem becomes apparent when trying to decompress the files using the w_decode
utility. The correct version of these files is now available on a third CD-Rom, containing the 16 speech files
and a readme.txt, listing the contents of the disc. If you purchased the corpus, please request the CD by
writing to ldc@ldc.upenn.edu. The new orders will receive the two
CDs and the third disc with the corrected files.
Content Copyright
Portions © 1996 Trustees of the University of Pennsylvania |