LDC98S71 - Speech data
LDC98T28 - Transcripts
Introduction
This set of 3 DVD-ROMs contains a total of 97 hours of recordings from radio and
television news broadcasts, gathered between June 1997 and February 1998. It has
been prepared to serve as a supplement to the 1996 Broadcast News Speech
collection (consisting of over 100 hours of similar recordings). The primary
motivation for this collection is to provide additional training data for the DARPA
"HUB4" Project on continuous speech recognition in the broadcast domain.
Data
Transcripts have been made of all recordings in this publication, manually time
aligned to the phrasal level, annotated to identify boundaries between news stories,
speaker turn boundaries and gender information about the speakers. The
transcription conventions are described in the file "transcrp.doc" -- please note that
this file describes the transcription methods by reference to text formatting
conventions used internally by the LDC during the transcription process. The
released version of the transcripts is in SGML format, comparable to the format that
was used in the 1996 Broadcast News Speech transcriptions and there is
accompanying documentation and an SGML DTD file, included with the
transcription release.
Updates
There are no updates at this time.
Copyright
Pricing
The Reduced Licensing Fee for this corpus is US$600. |