Introduction
This page contains information on The Fisher English Corpus Part 1 Transcripts, LDC catatalog ID LDC2004T19, ISBN 1-58563-314-3.
This corpus represents the first half of a collection of
conversational telephone speech (CTS) that was created at the LDC
during 2003. It contains transcript data for 5,850 complete
conversations, each lasting up to 10 minutes. In addition to the
transcriptions, which are found under the "trans" directory, there is
a complete set of tables describing the speakers, the properties of
the telephone calls, and the set of topics that were used to initiate
the conversations.
Data
Overall, about 12% of the conversations were transcribed at the LDC,
and the rest were done by BBN and WordWave using a significantly
different approach to the task. A central goal in both sets was to
maximize the speed and economy of the transcription process. This
in turn involved certain aspects of mark-up detail and quality control
that may have been common in previous, smaller corpora.
The LDC transcripts were based on automatic segmentation of the audio
data, to identify the utterance end-points on both channels of each
conversation. Given these time stamps, manual transcription was
simply a matter of typing in the words for each segment and doing a
rudimentary spell-check. No attempt was made to modify the
segmentation boundaries manually, or to locate utterances that the
segmenter might have missed. Portions of speech where the transcriber
could not be sure exactly what was said were marked with double
parentheses -- " (( ... )) " -- and the transcriber could hazard a
guess as to what was said, or leave the region between parentheses
blank. The LDC transcription process yields one plain-text transcript
file per conversation, in which the first two lines show the call-ID
and the fact that the transcript was done at the LDC; the remainder of
the file contains one utterance per line (with blank lines separating
the utterances), with the start-time, end-time, speaker/channel-ID and
utterance text.
Data collection and transcription were sponsored by DARPA and the
U.S. Department of Defense, as part of the EARS project for research
and development in automatic speech recognition.
Samples
Please examine this sample to see an example of the data in this corpus. |