MetricsMATR is a series of research challenge events for machine translation
(MT) metrology, promoting the development of innovative, even revolutionary,
MT metrics that correlate highly with human assessments of MT quality. In this
program, participants submit their metrics to the National
Institute of Standards and Technology (NIST). NIST runs those metrics on
certain held-back test data for which it has human assessments measuring quality
and then calculates correlations between the automatic metric scores and the
This release contains the development data received by participants in NIST Metrics
for Machine Translation 2008 Evaluation (MetricsMATR08). Specifically, this corpus is comprised of a subset of the materials used in the NIST
Open MT06 evaluation and includes human reference translations, system
translations, and human assessments of adequacy and preference. The source data
consists of twenty-five Arabic language newswire documents with a total
of 249 segments. The data in each segment includes four human reference
translations in English and system translations from eight different MT06 machine
translation systems. In addition to the data and reference translations, this
release inlcudes software tools for evaluation and reporting and documentation
describing how the human assessments were obtained and how they are represented
in the data. The evaluation
plan contains further information and rules on the use of this data.
The MetricsMATR program seeks to overcome several drawbacks to the methods
employed for the evaluation of MT technology. Currently, automatic metrics have
not yet proved able to predict the usefulness and reliability of MT technologies
with confidence. Nor have automatic metrics demonstrated that they are meaningful
in target languages other than English. Human assessments, however, are expensive,
slow, subjective and difficult to standardize. These problems, and the need
to overcome them through the development of improved automatic (or even semi-automatic)
metrics, have been a constant point of discussion at past NIST MT evaluation
events. MetricsMATR aims to provide a platform to address these shortcomings.
Specifically, the goals of MetricsMATR are:
- To inform other MT technology evaluation campaigns and conferences with
regard to improved metrology.
- To establish an infrastructure that encourages the development of innovative
- To build a diverse community that will bring new perspectives to MT metrology
- To provide a forum for MT metrology discussion and for establishing future
directions of MT metrology.
The MetricsMATR08 development data set released here is reflective of the
test data set only to a degree; the evaluation data set contains more varied
data -- from more genres, more source languages, more systems and different
evaluations -- than this development data set. There are also more types of
human assessments for the test data. The MetricsMATR08 test data remains unseen
to allow for repeated use as test data.
The software used for obtaining the human judgments included in this data set
is the same software used for the NIST
Open MT08 human assessments. It includes a description of the adequacy and
preference assessment tasks and the instructions given to the judges. All segments
assessed were judged by two independent judges. Adequacy judgments were performed
for all segments of each document. Preference judgments were performed
for the first four segments of each document such that full pair-wise comparisons
between all eight MT systems were obtained. All judgments were performed against
only one reference translation. The score represents an adjudicated score
over the two individual judgments.
results of MetricsMATR08 on the test data for the metrics submitted to MetricsMATR08
are publicly available. NIST performed the same analyses on the MetricsMATR08
development data after the evaluation. These results are not publicly available,
but will likely be available on request in the future by contacting email@example.com.
For an example of the data in this release, please examine these sample scores and judgments.
Portions © 2006 Agence France-Presse, © 2006, 2008, 2009 Trustees
of the University of Pennsylvania