Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Arabic News Translation Text Part 1

Item Name: Arabic News Translation Text Part 1
Authors: Xiaoyi Ma, Dalal Zakhary, and Moussa Bamba
LDC Catalog No.: LDC2004T17
ISBN: 1-58563-307-0
Release Date: Sep 23, 2004
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): cross-lingual information retrieval, language teaching, machine translation
Language(s): English, Modern Standard Arabic
Language ID(s): arb, eng
Distribution: Web Download
Member fee: $0 for 2004 members
Non-member Fee: US $3000.00
Reduced-License Fee: US $1500.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Xiaoyi Ma, Dalal Zakhary, and Moussa Bamba
2004
Arabic News Translation Text Part 1
Linguistic Data Consortium, Philadelphia

Introduction

Arabic News Translation Text Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T17 and ISBN 1-58563-307-0.

To support the development of automatic machine translation systems, the LDC was sponsored to solicit English translations for a single set of Arabic source materials. The source Arabic text was selected and translated in different LDC projects during the time period of November 2002 to February 2004. A total of about 441K Arabic words were selected from three sources, namely Xinhua, AFP, and An Nahar, and translation services were provided by eight translation agencies who translated each Arabic news story once.

The Xinhua and An Nahar stories and their translations were created for TIDES Machine Translation, while the AFP stories and their English translations were created for TIDES TDT. The development of all these translations followed roughly the same guidelines and procedures.

Data

Three sources of journalistic Arabic text were selected to provide the Arabic material:

 - AFP News Service:    250 news stories, October 1998 - December 1998
 - Xinhua News Service: 670 news stories, November 2001 - March 2002
 - An Nahar:            606 news stories, October 2001 to December 2002
(total: 1,526 stories)

The overall count of Arabic words by source is shown in the following table:

AFP       44,193
Xinhua    99,514
An Nahar 297,533
----------------
total    441,240

For the Arabic data, there are 441K-words, while for the English translation, there are approximately 581K-words in total, and 25K unique words.

Each translation team was provided with translation guidelines. In accordance with the guidelines, each translation team was asked to return the first five stories for quality checking in each project. This was to ensure that each translation team had indeed understood and was following the guidelines, and the translation quality was acceptable. The LDC sent the translations back to the translation team for any deviations from the guidelines or any quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. An Arabic-English bilingual LDC employee went through all the source data and English translations, and fixed any problems that had been found.

For the present release, the corpus content is organized into source and translation directories, containing 1,526 files in source and 1,526 files in translation, one news story per file.

Content Copyright

Portions © 2001-2002 An Nahar, © 2001-2002 Xinhua News Agency, © 1998 Agence France-Presse, © 2004 Trustees of the University of Pennsylvania.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.