Introduction
This file contains documentation for Voice of America (VOA) Czech Broadcast News Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2000T53 and ISBN
1-58563-180-9. The transcripts described below correspond to Voice of America (VOA) Czech Broadcast News Audio, LDC2000S89 and ISBN 1-58563-179-5.
Support for this work was provided by the Ministry of Education of the Czech
Republic (Grant No. VS97159); by the Ministry of Education of the Czech
Republic (Project ME293); and by the NSF Language Engineering Workshop at the
Johns Hopkins University, Baltimore, MD USA (NSF Grant No. IIS-9820687).
Data
Between February 9 and May 28, 1999, the Linguistic Data Consortium
collected approximately 30 hours of Czech broadcast audio from the Voice of America
news service. The 62 data files presented in this corpus represent the
transcripts of the daily broadcasts of 30-minute news programs.
The transcriptions were created by native Czech speakers, Pavel Ircing,
Jindrich Matousek, Ludek Muller, and Vlasta Radova, working at the Department
of Cybernetics, University of West Bohemia (UWB) in Pilsen under the direction
of Josef Psutka. They used transcription software provided by the LDC (the
"transcriber" package), developed by Eduoard Geoffrois and Claude Barras at DGA,
France, with assistance from Zhibiao Wu at the LDC. The package is currently
available from the LDC web site: www.ldc.upenn.edu.
The version of transcriber used for this project produced a text file format
which is no longer supported by the current version of the software; also, the
format does not resemble any previous transcription format published by the
LDC. It was therefore decided to transform the files created at UWB into an
SGML format that has been used previously for other broadcast news
transcription corpora.
The transcript files are presented here in a format that was defined by the
speech group at NIST, who refer to it as the "Universal Transcription Format"
(UTF -- not to be confused with the "Unicode Transformation Formats"). A
separate description of the UTF SGML format is provided in the files "utf.ps"
(Postscript) and "utf.pdf" (Adobe Acrobat), and the formal SGML definition is
provided in "utf.dtd," all in the "doc" directory. A useful summary of the
format, along with additional information about its application to the VOA
Czech transcripts, is provided below.
The transcription text is rendered using the ISO 8859-2 character set.
Information relating this character set to the Unicode standard is available at this
site and from the Unicode Consortium.
Due to technical limitations in the hardware at LDC that was used to receive
the VOA broadcasts via a satellite downlink, a number of files contain brief
portions where the audio signal was interrupted. These interruptions typically
yielded regions of complete silence that lasted less than two seconds and were
scattered sparsely throughout an affected audio file. Additional markup was
provided in the transcription texts to isolate the regions where these
interruptions occurred.
Please click on LDC2000T53.sample to view
an example transcript.
Updates
There are no updates at this time.
Copyright
Portions © 2000 Trustees of the University of Pennsylvania |