This publication contains the Arabic Newswire A Corpus, Linguistic Data
Consortium (LDC) catalog number LDC2001T55 and ISBN 1-58563-190-6. The Arabic
Newswire Corpus is composed of articles from the Agence France Presse (AFP)
Arabic Newswire. The source material was tagged using TIPSTER-style SGML and
was transcoded to Unicode (UTF-8). The corpus includes articles from May 13,
1994 to December 20, 2000.
The data is in 2,337 compressed (zipped) Arabic text data files. There are
209 Mb of compressed data (869 Mb uncompressed) with approximately
383,872 documents containing 76 million tokens over approximately 666,094
A template of the tagging is presented below.
One or More Paragraphs of Arabic Text
For a sample file of tagged articles, please see this sample.
There are no updates at this time.
Portions © 1994-2000 Agence France Press