Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Fisher English Training Speech Part 1 Transcripts

Item Name: Fisher English Training Speech Part 1 Transcripts
Authors: Christopher Cieri, David Graff, Owen Kimball, Dave Miller, and Kevin Walker
LDC Catalog No.: LDC2004T19
ISBN: 1-58563-314-3
Release Date: Dec 15, 2004
Data Type: text
Project(s): EARS, GALE
Application(s): speech recognition
Language(s): English
Language ID(s): ENG
Distribution: 1 CD
Member fee: $0 for 2004 members
Non-member Fee: US$1000.00
Reduced-License Fee: US$500.00
Extra-Copy Fee: US$150.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Christopher Cieri, et al.
2004
Fisher English Training Speech Part 1 Transcripts
Linguistic Data Consortium, Philadelphia

Introduction

This page contains information on The Fisher English Corpus Part 1 Transcripts, LDC catatalog ID LDC2004T19, ISBN 1-58563-314-3.

This corpus represents the first half of a collection of conversational telephone speech (CTS) that was created at the LDC during 2003. It contains transcript data for 5,850 complete conversations, each lasting up to 10 minutes. In addition to the transcriptions, which are found under the "trans" directory, there is a complete set of tables describing the speakers, the properties of the telephone calls, and the set of topics that were used to initiate the conversations.

Data

Overall, about 12% of the conversations were transcribed at the LDC, and the rest were done by BBN and WordWave using a significantly different approach to the task. A central goal in both sets was to maximize the speed and economy of the transcription process. This in turn involved certain aspects of mark-up detail and quality control that may have been common in previous, smaller corpora.

The LDC transcripts were based on automatic segmentation of the audio data, to identify the utterance end-points on both channels of each conversation. Given these time stamps, manual transcription was simply a matter of typing in the words for each segment and doing a rudimentary spell-check. No attempt was made to modify the segmentation boundaries manually, or to locate utterances that the segmenter might have missed. Portions of speech where the transcriber could not be sure exactly what was said were marked with double parentheses -- " (( ... )) " -- and the transcriber could hazard a guess as to what was said, or leave the region between parentheses blank. The LDC transcription process yields one plain-text transcript file per conversation, in which the first two lines show the call-ID and the fact that the transcript was done at the LDC; the remainder of the file contains one utterance per line (with blank lines separating the utterances), with the start-time, end-time, speaker/channel-ID and utterance text.

Data collection and transcription were sponsored by DARPA and the U.S. Department of Defense, as part of the EARS project for research and development in automatic speech recognition.

Samples

Please examine this sample to see an example of the data in this corpus.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.