Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Arabic Newswire English Translation Collection

Item Name: Arabic Newswire English Translation Collection
Authors: Xiaoyi Ma, Dalal Zakhary
LDC Catalog No.: LDC2009T22
ISBN: 1-58563-521-9
Release Date: Aug 18, 2009
Data Type: text
Data Source(s): newswire
Application(s): natural language processing, syntactic parsing
Language(s): Arabic, English
Language ID(s): arb, eng
Distribution: Web Download
Member fee: $0 for 2009 members
Non-member Fee: US $1500.00
Reduced-License Fee: US $750.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Xiaoyi Ma, Dalal Zakhary
2009
Arabic Newswire English Translation Collection
Linguistic Data Consortium, Philadelphia

Introduction

The Arabic English Newswire Translation Collection was produced by the Linguistic Data Consortium (LDC). It consists of approximately 550,000 words of Arabic newswire text and its English translation from Agence France Presse (France), An Nahar (Lebanon) and Assabah (Tunisia). The source Arabic text was used in LDC's Arabic Treebank, specifically, in Part 1 (Part 1 v. 2.0; Part 1 v. 3.0), Part 3 (Part 3 v. 1.0; Part 3 v. 2.0) and Part 4 (Part 4 v. 1.0). A subset of Agence France Presse (AFP) source text from Arabic Treebank: Part 1 v. 2.0 was previously translated and released by LDC in Arabic Treebank: Part 1 - 10K-word English Translation, LDC2003T07. The English translations in this corpus were provided by translation agencies using LDC's Arabic Translation Guidelines.

Data

The number of stories and their epochs for each source are as follows:
AFP 734 stories; July 2000 - November 2000
An Nahar 600 stories; January 2002 - December 2002
Assabah 397 stories; September 2004 - November 2004
Total 1731 stories

Word count of Arabic tokens by source is shown in the following table:
AFP102,564
An Nahar299,681
Assabah149,259

Total 551,504

The original source files used different encodings for the Arabic characters, including UTF8 and ASMO. SGML tags were used for marking sentence and paragraph boundaries and for annotating other information about each story. All Arabic source data was converted to UTF and most SGML tags were removed or replaced by "plain text" markers.

Samples

Content Copyright

Portions © 2000 Agence-France Presse, © 2002 An Nahar, © 2004 Assabah, © 2002-2005, 2009 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.