|LDC98S73 - Speech data|
LDC98T24 - Transcripts
This collection consists of 30 hours of recorded broadcasts and
transcripts that have been drawn from the following sources:
Voice of America (VOA): United States Information Agency
People's Republic of China Television (CCTV)
Commercial radio based in Los Angeles, CA. (KAZN-AM)
Of these three sources, the first two comprise the bulk of the
collection and are represented in roughly equal amounts; only a
relatively small sample of KAZN-AM recordings are included, owing
to the relatively high proportion of unusable material (commercials,
local traffic reports loaded with California place names, etc).
The transcripts were created by native speakers of Mandarin working
at the LDC; they are in GB-encoded form, with SGML tagging to
identify story boundaries, speaker turn boundaries and phrasal
pauses; these tags include time stamps to align the text with the
speech data. Word segmentation (white-space between words) is
included. A working DTD is provided and the markup is consistent
with that of the 1997 English and Spanish HUB4 collections.
There are no updates at this time.
Portions © 1997 China Central TV, © 1997 MultiCultural Broadcasting Corporation, © 1997, 1998 Trustees of the University of Pennsylvania
The Reduced Licensing Fee for this corpus is US$400.