Introduction:
GALE Phase 1 Chinese Broadcast Conversation Parallel
Text - Part 1, Linguistic Data Consortium (LDC) catalog number LDC2009T02 and
ISBN 1-58563-499-9, contains transcripts and English translations of 20.4 hours
of Chinese broadcast conversation programming from China Central TV (CCTV) and
Phoenix TV. It does not contain the audio files from which the transcripts and
translations were generated. GALE Phase 1
Chinese Broadcast Conversation Parallel Text - Part 1, along with other corpora,
was used as training data in year 1 (Phase 1) of the DARPA-funded GALE program.
Source Data:
A total of 20.4 hours of Chinese broadcast conversation
programming were selected from two sources: CCTV (a broadcaster from Mainland
China), and Phoenix TV (a Hong Kong -based satellite TV station). The transcripts
and translations represent recordings of eight different programs.
A manual selection procedure was used to
choose data appropriate for the GALE program, namely conversation (talk) programs
focusing on current events. Stories on topics such as sports, entertainment
and business were excluded from the data set. The following table is a summary
of the files included in this release.
|
Source
|
Program
|
Epoch (YYYY.MM)
|
#hours
|
#characters
|
|
CCTV
|
Across China
|
2005.08
|
1.0
|
9,924
|
|
Todays Focus
|
2005.11
|
2.2
|
33,805
|
|
Phoenix TV
|
Asian Journal
|
2005.09
|
2.2
|
26,656
|
|
Behind the Headlines
|
2005.03 - 2005.11
|
1.5
|
17,933
|
|
A Date With Lu Yu
|
2005.09 - 2005.10
|
7.1
|
89,987
|
|
News Hacker
|
2005.03 - 2005.10
|
2.3
|
39,388
|
|
Newsline
|
2005.10 - 2005.11
|
1.6
|
15,496
|
|
Social Watch
|
2005.09 - 2005.11
|
2.5
|
29,159
|
Transcription:
The selected audio snippets were carefully transcribed
by LDC annotators and professional transcription agencies following LDCs Quick
Rich Transcription specification.
Manual sentence units/segments (SU) annotation was also performed as part of
the transcription task. Three types of end of sentence SU are identified:
-
statement SU
-
question SU
-
incomplete SU
Translation:
After transcription and SU annotation, files were
reformatted into a human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDCs GALE Translation
guidelines which describe the makeup of the translation team, the source data
format, the translation data format, best practices for translating certain
linguistic features (such as names and speech disfluencies) and quality control
procedures applied to completed translations.
TDF Format:
All final data are in Tab
Delimited Format (TDF). TDF is compatible with other transcription
formats, such as the Transcriber format and AG format, and it is
easy to process.
Each line of a TDF file
corresponds to a speech segment and contains 13 tab delimited
fields:
Field
|
Data Type
|
file
|
unicode
|
channel
|
int
|
start
|
float
|
end
|
float
|
speaker
|
unicode
|
speakerType
|
unicode
|
speakerDialect |
unicode
|
transcript
|
unicode
|
section
|
int
|
turn
|
int
|
segment
|
int
|
sectionType
|
unicode
|
suType
|
unicode
|
A source TDF file and its
translation are the same except that the transcript in the source
TDF is replaced by its English translation.
Sponsorship
This work was supported in part by the Defense
Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003.
The content of this publication does not necessarily reflect the position or
the policy of the Government, and no official endorsement should be inferred.
Samples
For an example of the data in this corpus, please examine these images of the source and translation.
Content Copyright:
Portions © 2005 China Central
TV, © 2005 Phoenix TV, © 2005 - 2007, 2009 Trustees of the University
of Pennsylvania. |