Introduction
The text component of the CALLHOME
English package includes transcripts and documentation files for
120 unscripted telephone conversations between native speakers of
English; a separate catalog entry, (LDC97S42)
provides the speech data for these conversations, which are
partitioned into separate subdirectories for "training" (80
conversations), "development test set" (20 conversations) and
"evalutation test set" (20 conversations).
Data
The transcripts cover a contiguous ten minute segment of each call in
the training and development test sets, and a five minute segment of each
call in the evaluation set, yielding a total of 18.3 hours of
transcribed spontaneous speech, comprising about 230,000 words. The
transcripts are timestamped by speaker turn for alignment with the
speech signal and are provided in standard orthography.
In addition to transcript files, this corpus contains full
documentation on the transcription conventions and format. Complete
auditing information on the speakers represented in the transcripts
(including gender, channel quality and so on) is also included.
This corpus is distributed throughout the LDC's FTP server.
The corpus of telephone speech (LDC97S42) is
available separately, as well as an associated lexicon (LDC97L20).
Updates
There are no updates at this time.
Content Copyright |