Arabic Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T12 and ISBN 1-58563-271-6.
This is a comprehensive archive of newswire text data that has been
acquired from Arabic news sources by the Linguistic Data Consortium (LDC) at the University of Pennsylvania.
Four distinct sources of Arabic newswire are represented here:
| ||Agence France Presse||(afa)|
| ||Al Hayat News Agency||(alh)|
| ||Al Nahar News Agency||(ann)|
| ||Xinhua News Agency||(xin)|
Much of the AFP content in this collection has been published previously by the LDC in Arabic Newswire Part 1 (LDC2001T55) and some of this content has also been included in an Arabic supplement to
TDT3 and as the Arabic component of TDT4. TDT4 also included a four
month sample from Al Hayat and An Nahar (October 2000 - January
2001). Apart from that, all of the Al Hayat, An Nahar and Xinhua Arabic
content, as well as AFP content for 2001-2002, is being released here
for the first time.
There are 319 files, totalling approximately 1.1GB in compressed form (4348 MB uncompressed, and 391619 Kwords).
The table below presents the following categories of information:
source of the data,
number of files per source,
Gzip-MB shows totals for compressed file sizes,
Totl-MB shows totals for uncompressed file sizes (i.e. approximately 4.3 gigabytes, total),
K-wrds are the number of space-separated tokens in the text, excluding SGML tags.
All text files in this corpus have been converted to UTF-8 character encoding.
Owing to the use of UTF-8, the SGML tagging within each file shows
up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text
data, including article headlines and datelines, contain a mixture of
single-byte and multi-byte characters. In general, single-byte
characters in the text data will consist of digits and punctuation
marks (where the original source relied on ASCII punctuation codes,
rather than Arabic-specific punctuation), whereas multi-byte
characters consist of Arabic letters and a small number of special
punctuation or other symbols. This variable-width character encoding
is intrinsic to UTF-8, and all UTF-8 capable processes will handle the
Each data file name consists of the three-letter prefix, followed by a
six-digit date (representing the year and month during which the file
contents were generated by the respective news source), followed by a
".gz" file extension, indicating that the file contents have been
compressed using the GNU "gzip" compression utility (RFC 1952). So,
each file contains all the usable data received by LDC for the given
month from the given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure. The corpus has been fully validated by a standard SGML
parser utility (nsgmls), using the DTD file provided in the publication.
Unlike older corpora, the present corpus uses only the information structure that is common
to all sources and serves a clear function: headline, dateline, and
core news content (usually containing paragraphs).
All sources have received a uniform treatment in
terms of quality control, and have been categorized into three distinct "types":
|story ||this type of DOC represents a coherent report on
a particular topic or event, consisting of paragraphs and full
|multi ||this type of DOC contains a series of unrelated
"blurbs," each of which briefly describes a particular topic or
event: "summaries of today's news," "news briefs
in ... (some general area like finance or sports)," and so
|other ||these DOCs clearly do not fall into any of the
above types; these are things like lists of sports scores,
stock prices, temperatures around the world, and so
The general strategy for categorizing DOCs into these three classes
was, for each source, to discover the most common and frequent clues
in the text stream that correlated with the "non-story" types. When none of
the known clues was in evidence, the DOC was classified as a "story."
Previous "Gigaword" corpora (in English and Chinese) had a fourth
category, "advis" (for "advisory"), which applied to DOCs that contain
text intended solely for news service editors, not the news-reading
public. In preparing the Arabic data, the task of determining
patterns for assigning "non-story" type labels was carried out by a
native speaker of Arabic. For whatever reason, this person did
not find the "advis" category to be applicable to any of the data.
This edition of Arabic Gigaword has been superseded by a a new edtion, LDC2006T02
Portions © 1994-2002 Agence France Presse,
© 1994-2001 Al Hayat News Agency,
© 1995-2002 An Nahar News Agency,
© 2001-2003 Xinhua News Agency,
© 2003 Trustees of the University of Pennsylvania