Introduction
Czech Broadcast News MDE Transcripts, Linguistic
Data Consortium (LDC) catalog number LDC2010T02 and isbn 1-58563-534-0, was
prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic.
It consists of metadata extraction (MDE) annotations for the approximately 26
hours of transcribed broadcast news speech in
Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding
to the transcripts in this corpus are contained in
Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts
joins LDC's other holdings of Czech broadcast data:
Czech Broadcast Conversation Speech (LDC2009S02),
Czech Broadcast Conversation MDE Transcripts (LDC2009T20),
Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and
Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).
The audio recordings were collected from February 1,
2000 through April 22, 2000 from three Czech radio stations (Cesky rozhlas 1
Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2 and Cesky rozhlas 3 Vlatva
- CRo3) and two television stations (Ceska televize - CTV and Prima TV).
The broadcasts included both public and commercial subjects and were presented
in various styles, ranging from a formal style to a colloquial style more typical
for commercial broadcast companies that do not primarily focus on news.
The goal of MDE research is to take raw speech recognition output and refine
it into forms that are of more use to humans and to downstream automatic processes.
In simple terms, this means the creation of automatic transcripts that are maximally
readable. This readability might be achieved in a number of ways: removing non-content
words like filled pauses and discourse markers from the text; removing sections
of disfluent speech; and creating boundaries between natural breakpoints in
the flow of speech so that each sentence or other meaningful unit of speech
might be presented on a separate line within the resulting transcript. Natural
capitalization, punctuation, standardized spelling and sensible conventions
for representing speaker turns and identity are further elements in the readable
transcript.
The transcripts and annotations in this corpus are stored in two formats: QAn
(Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2.
More information can be found on the website Structural
Metadata Annotation for Czech.
Sponsorship
The completion of this corpus was facilitated by funding provided by
the Ministry of Education of the Czech Republic under projects No.
2C06020 and ME909.
Samples
- Quick Annotator Transcript
- RTTM Annotation
Content Copyright
Portions © 2000 Ceska televize, © 2000 Cesky rozhlas 1 Radiozurnal,
© 2000 Cesky rohlas 2 Praha, © 2000 Cesky rozhlas 3 Vlatva, ©
2000 FTV Primiera, © 2004, 2010 Trustees of the University of Pennsylvania |