Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Multiple-Translation Arabic (MTA) Part 2

Item Name: Multiple-Translation Arabic (MTA) Part 2
Authors: Xiaoyi Ma
LDC Catalog No.: LDC2005T05
ISBN: 1-58563-328-3
Release Date: Feb 15, 2005
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): cross-lingual information retrieval, language teaching, machine translation
Language(s): English, Modern Standard Arabic
Language ID(s): arb, eng
Distribution: Web Download
Member fee: $0 for 2005 members
Non-member Fee: US$1000.00
Reduced-License Fee: US$500.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Xiaoyi Ma
2005
Multiple-Translation Arabic (MTA) Part 2
Linguistic Data Consortium, Philadelphia

Introduction

Multiple-Translation Arabic (MTA) Part 2 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T05 and ISBN 1-58563-328-3.

To support the development of automatic means for evaluating translation quality, the LDC was sponsored to solicit four sets of human translations for a single set of Arabic source materials. The LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial Machine Translation (MT) systems as well as MT systems available on the Internet). There are a total of two sets of COTS outputs, and one output set from a TIDES 2003 MT Evaluation participant, which is representative for the state-of-the-art research systems.

To see if automatic evaluation systems such as BLEU track human assessment, the LDC has also performed human assessment on the two COTS outputs and the TIDES research system. The corpus includes the assessment results for one of the two COTS systems, the assessment result for the TIDES research system, and the specifications used for conducting the assessments.

Source Data Selection:

  • Xinhua News Service: 50 news stories
  • AFP News Service: 50 news stories

(total: 100 stories)

There are 100 source files, and 700 translation files. All source data were drawn from January and February 2003 collection of Xinhua news Arabic data and AFP Arabic data.

The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1,500 Arabic characters.

The overall count of Arabic words (excluding markup), by source, is shown in the following table:

  • AFP 7,528
  • Xinhua 7,551
  • -------------
  • 15,079

    Samples

    To see an examples of this corpus, please examine this screen shot of the Arabic source file and its translation.

    Please contact Xiaoyi Ma with any questions regarding this corpus.

    Content Copyright

    Portions © 2003 Xinhua News Agency, © 2003 Agence France Press, © 2004-2005 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.