|LDC98S74 - Speech data |
LDC98T29 - Transcripts
This corpus contains a portion of the acoustic data designated as
the training set for the 1997 DARPA HUB4 Spanish Benchmark. It
contains speech and transcripts of 30 hours of broadcast news from the
following sources: Televisa, Univision and VOA.
All acoustic files are in NIST SPHERE format, without compression.
The sample data are 16-bit linear PCM, 16-KHz sample frequency, single
channel. Most files contain 30 minutes of recorded material and some
contain 60 or 120 minutes (approximately); the sampling format
requires roughly two megabytes (MB) per minute of recording, so the file
sizes are typically around 60 MB, with some files ranging up to 120 or
The transcripts are in SGML format, using the same markup
conventions that have been applied to the other 1997 Broadcast News
speech corpora (in English and Mandarin) and are transmitted by FTP,
not on the CD-ROMs with speech data.
There are no updates at this time.
Portions © 1997 Televisa S.A. de C.V., © 1997 Univision Network Limited Partnership, © 1997, 1998 Trustees of the University of Pennsylvania
The Reduced Licensing Fee for this corpus is US$400.