Introduction
Multiple-Translation Chinese (MTc) Part 4 was produced by Linguistic Data Consortium (LDC) catalog number
LDC2006T04 and ISBN 1-58563-375-5.
To support the development of automatic means for evaluating translation
quality, the LDC was sponsored to solicit four sets of human translations for a
single set of Chinese source materials. The LDC was also asked to produce
translations from various commercial-off-the-shelf-systems (COTS, including
commercial Machine Translation (MT) systems as well as MT systems available on
the Internet).
There are a total of five sets of COTS outputs and six output sets from TIDES 2003 MT Evaluation participants.
To see if automatic evaluation systems, such as BLEU, track human assessment,
the LDC has also performed human assessment on one COTS output and the six TIDES research
systems. The corpus includes the assessment results for one of the five COTS systems, the assessment result for the six
TIDES research systems, and the specifications used for conducting the assessments.
Data
Source Data Selection
Two sources of journalistic Chinese text were selected to provide the Chinese material:
- Xinhua News Agency: 50 news stories
- AFP News Service: 50 news stories
(total: 100 stories)
There are 100 source files, and 1,100 translation files. All source data were drawn from LDC's January and February 2003 collection of Xinhua news Chinese data and AFP Chinese data.
The story selection from the two newswire collections was controlled by story
length: all selected stories contain between 280 and 605 Chinese characters.
The overall count of Chinese words (excluding markup), by source, is shown in the following table:
AFP 22,450
Xinhua 19,650
-------------
42,100
For the Chinese data, there are approximately 21K-words, while for the English translations, there are 396K-words in total and 16K unique words.
Source Data Preparation for Human Translation
The original source files used GB-2312 encoding for the Chinese characters, and
SGML tags for marking sentence and paragraph boundaries and other information
about each story. The character encoding has been left unaltered. To make things easier for the translators,
nearly all sgml tags were removed, or replaced by "plain text" markers.
Specifically, each story was presented to the human translators in the following format:
--Segment 1--
{Chinese text to be translated}
--Segment 2--
{Chinese text to be translated}
--Segment 3--
{Chinese text to be translated}
...
Each --Segment-- corresponds to a Chinese sentence.
The rationale for using the term "segment" instead of
"sentence" was to discourage the translators from inserting
additional "-Sentence-" markers if an Chinese sentence was translated
into two or more English sentences.The
markers were intended to assure that the resulting translations would be
easily alignable to the source texts, so extra care was taken to make sure
that they would be kept intact and properly oriented.
Some cleaning had to be
done for all the files to conform to the above format, including splitting very
long segments into smaller chunks and adding segment markers.
As a last step, all files were converted from UNIX-style line termination
(new-line only) to MS-DOS-style (carriage-return plus line-feed), on the
assumption that most (possibly all) translators would use MS-Windows-based
editors.
Human Translation Procedure and Quality Assessment
Each initially selected translation team received the translation guidelines
and a sample pair of source and translation (excluded from the final release)
for review. After the team indicated that they understood the task requirements and would be willing to participate
in the project, 100 news stories were sent to them.
In accordance with the guidelines, each translation team was asked to return
the first five AFP stories for quality checking.
This was to ensure that the translation team had indeed was following the guidelines and the translation quality was acceptable. The
LDC sent the translations back to the translation team for any deviations from
the guidelines or quality issues detected.
Subsequent translation submissions were continuously monitored for conformance
and quality. Once the full set of translations was complete, a final pass of reformatting and validation was
carried out to assure alignability of segments, and to convert the translated
texts into SGML format.
Each translation team was also asked to complete and return a questionnaire to
describe their procedures and professional background.
Machine Translation Procedure
Complete sets of automatic MT translations were also produced by submitting
the 100 stories to each of the five publicly-available MT systems.
Starting from the original SGML text format, special alterations were made to
the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in
terms of the input and retrieval methods required to submit the source data
for translation and to save the translated text in alignable form.
Human Assessment Procedure
The goal of this effort is to evaluate the quality of TIDES
research, human translation teams and commercial
off-the shelf (COTS) systems. Translations are evaluated on the basis of
adequacy and fluency. Adequacy
refers to the degree to which the translation communicates information present
in the original source language text. Fluency refers to the degree to which
the translation is well-formed
according to the grammar of the target language.
Final Data Format and Validation
For the present release, the corpus content is organized into source
and translation directories.
Within translation there is a
separate subdirectory for each translation service or system, identified as
follows:
Human translators: E01 E02 E03 E04
COTS systems: E05 E06 E07 E08 E09
Research systems: E11 E12 E14 E15 E17 E22
The source directory and each of the human and COTS translation subdirectories
contain 100 files, one news story per file. Corresponding file names are
identical across all directories, consisting of "docid.sgm."
Within each source file, the content is formatted in SGML as follows:
[Chinese text in GB-2312 character encoding]
[Chinese text in GB-2312 character encoding]
...
Ranking of Manual Translations
Ranking of manual translations was performed by two LDC staff members, one a
Chinese-dominant bilingual and the other an English native monolingual. There was
overall agreement on the ranking between the two and minor discrepancies were
resolved through discussion and comparison of additional files. The ranking
for the manual translations is:
best-----------------------------worst
E01 > E02 > E03 > E04 >
The ranking method was unstructured and somewhat casual -- it is not intended
to be definitive, or even accountable.
Samples
For an example of the data provided in this corpus, please review the following samples:
Content Copyright
Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2005-2006 Trustees of the University of Pennsylvania |