Introduction
This file contains documentation for NIST Open Machine Translation 2008 Evaluation
(MT08) Selected Reference and System Translations, Linguistic Data Consortium
(LDC) catalog number LDC2010T01 and isbn 1-58563-533-2.
NIST Open MT is an
evaluation series to support research in, and help advance the state of the
art of, technologies that translate text between human languages. Participants
submit machine translation output of source language data to NIST (National
Institute of Standards and Technology); the output is then evaluated with automatic
and manual measures of quality against high quality human translations of the
same source data. This program supports the growing interest in system combination
approaches that generate improved translations from output of several different
machine translation (MT) systems. MT system combination approaches require data
sets composed of high-quality human reference translations and a variety of
machine translations of the same text. The NIST Open Machine Translation 2008
Evaluation (MT08) Selected Reference and System Translations set addresses this
need.
The data in this release consists of the human reference translations and corresponding
machine translations for the NIST
Open MT08 test sets, which consist of newswire and web data in the four
MT08 language pairs -- Arabic-to-English, Chinese-to-English, English-to-Chinese
(newswire only) and Urdu-to-English. Two documents per language pair and genre
were removed at random from the test sets for release. For the machine translations,
only output from one submission (in most cases, the participant's primary
submission) per training condition (Constrained and Unconstrained training,
where available) per participant is included. See section 2 of the MT08
Evaluation Plan for a description of the training conditions. The resulting
data set has the following characteristics:
- Arabic-to-English: 120 documents with 1312 segments, output from 17 machine
translation systems.
- Chinese-to-English: 105 documents with 1312 segments, output from 23 machine
translation systems.
- English-to-Chinese: 127 documents with 1830 segments, output from 11 machine
translation systems.
- Urdu-to-English: 128 documents with 1794 segments, output from 12 machine translation
systems.
The data is organized and annotated in such a way that subsets for each language
pair and/or data genre and/or training condition can be extracted and used separately,
depending on the user's needs.
Samples
Content Copyright
Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, An Nahar, Al
Quds - Al Arabi, Asharq Al-Awsat, Assabah, BBC, The Associated Press, China
Military Online, Chinanews.com, Daily Jang, Guangming Daily, Los Angeles Times
- Washington Post News Service, Inc., New York Times, PakTribune.com, People's
Daily Online, Xinhua News Agency, © 2007, 2009, 2010 Trustees of the University
of Pennsylvania |