Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Arabic Newswire Part 1

Item Name: Arabic Newswire Part 1
Authors: David Graff and Kevin Walker
LDC Catalog No.: LDC2001T55
ISBN: 1-58563-190-6
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES, TREC
Application(s): information retrieval, language modeling
Language(s): Modern Standard Arabic
Language ID(s): arb
Distribution: 1 CD
Member fee: $0 for 2001 members
Non-member Fee: US $1200.00
Reduced-License Fee: US $600.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff and Kevin Walker
2001
Arabic Newswire Part 1
Linguistic Data Consortium, Philadelphia

Introduction

This publication contains the Arabic Newswire A Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T55 and ISBN 1-58563-190-6. The Arabic Newswire Corpus is composed of articles from the Agence France Presse (AFP) Arabic Newswire. The source material was tagged using TIPSTER-style SGML and was transcoded to Unicode (UTF-8). The corpus includes articles from May 13, 1994 to December 20, 2000.

Data

The data is in 2,337 compressed (zipped) Arabic text data files. There are 209 Mb of compressed data (869 Mb uncompressed) with approximately 383,872 documents containing 76 million tokens over approximately 666,094 unique words.

A template of the tagging is presented below.

   		      
   yyyymmdd_AFP_ARB.dddd
   
Arabic Text
Arabic Text

One or More Paragraphs of Arabic Text

Arabic Text
Arabic Text

For a sample file of tagged articles, please see this sample.

Updates

There are no updates at this time.

Content Copyright

Portions © 1994-2000 Agence France Press


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.