Introduction
Multiple-Translation Chinese (MTC) Part 2 was produced by Linguistic Data Consortium (LDC) catalog number
LDC2003T17 and ISBN 1-58563-275-9.
To support the development of automatic means for evaluating translation
quality, the LDC was sponsored to solicit four sets of human translations for a
single set of Mandarin Chinese source materials. The LDC was also asked to
produce translations from various commercial-off-the-shelf-systems (COTS,
including commercial Machine Translation (MT) systems
as well as MT systems available on the Internet). There are a total of six sets of
COTS outputs, and one set of outputs from a TIDES MT Evaluation participant,
which is representative for the state-of-the-art research systems.
To see if automatic evaluation systems, such as BLEU, track human assessment,
the LDC has also performed human assessment on two of the six COTS outputs and the TIDES research
system. The corpus includes the assessment results for these two COTS systems, the assessment result for the
TIDES research system, and the specifications used for conducting the assessments.
A similar corpus, Multiple-Translation Chinese Corpus,
was published in 2002. Both the 2002 and the present corpus used Chinese news articles from the Xinhua and Zaobao News Service,
and provide human and COTS translations. However, Part 2 also offers translations from a TIDES research system,
and provides human assessment of some of the automatic translations.
Data
Source Data Selection
Two sources of journalistic Mandarin Chinese text were selected to provide the Chinese material:
- Xinhua News Service: 70 news stories
- Zaobao News Service: 30 news stories
(total: 100 stories)
The Xinhua data were drawn from March and April 2002 collection of Xinhua
news. The Zaobao data were drawn from March 2002 collection of Zaobao's
online news service.
The story selection from the two newswire collections was controlled
by story length: all selected stories contain between about 212 and 707 Chinese characters. The overall count
of Chinese characters by source is shown in the following table:
Xinhua 25247
Zaobao 14009
--------------
total 39256
Zaobao is a news portal from Singapore and many of its news stories
are translations from other news agencies' releases.
For the Chinese data, there are approximately 20K-words,
while for the English translation, there are approximately 258K-words in total, and 13K unique words.
Source Data Preparation for Human Translation
The original source files
used GB-2312 encoding for the Chinese characters, and SGML tags for
marking sentence and paragraph boundaries and other information about
each story. The character encoding has been left unaltered. To make
things easier for translators, nearly all sgml tags were removed, or
replaced by "plain text" markers.
Human Translation Procedure and Quality Assessment
Four best translation teams were chosen from the 11 teams which had
participated in the translation of Multiple Translation Chinese Corpus Part 1
(LDC2002T01) to take part in the project.
In accordance with the guidelines, each translation team was asked to
return the first 10 Xinhua stories for quality checking. This was to
ensure that the translation team had indeed understood and was
following the guidelines and the translation quality was acceptable.
The LDC sent the translations back to the translation team for any
deviations from the guidelines or quality issues detected.
Subsequent translation submissions were continuously monitored for
conformance and quality. Once the full set of translations was
complete, a final pass of reformatting and validation was carried out,
to assure alignability of segments, and to convert the translated
texts into SGML format.
Each translation team was also asked to fill out and return a
questionnaire to describe their procedures and professional
background.
Machine Translation Procedure
Complete sets of automatic MT translations were also produced by
submitting the 100 stories to each of six publicly-available MT systems.
Four of these were commercial MT software packages (off-the-shelf
products), and two were free web-based services. Starting from the
original SGML text format, special alterations were made to the files
on an as-needed basis, so that they would be accepted and handled
correctly by the various systems; also, the systems differed in terms
of the input and retrieval methods required to submit the source data
for translation and to save the translated text in alignable form.
Human Assessment Procedure
The goal of this effort is to evaluate the quality of TIDES
research, human translation teams and commercial
off-the-shelf (COTS) systems. Translations are evaluated on the basis of
adequacy and fluency. Adequacy
refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which
the translation is well-formed
according to the grammar of the target language.
Updates
There are no updates available at this time.
Content Copyright
Portions © 2003, Trustees of the University of Pennsylvania, © 2002 Xinhua News Agency, © 2002 SPH AsiaOne Ltd. |