Arabic Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number
LDC2009T30 and ISBN 1-58563-532-4, is a comprehensive archive of Arabic newswire
text that has been acquired over several years at LDC. Arabic Gigaword Fourth
Edition includes all of the content of Arabic
Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In
addition, three new sources have been added in the fourth edition: Al-Ahram,
Asharq Al-Awsat and Al-Quds Al-Arabi.
Nine distinct international sources of Arabic newswire are represented here:
- Al-Ahram (ahr_arb)
- Asharq Al-Awsat (aaw_arb)
- Agence France Presse (afp_arb)
- Assabah (asb_arb)
- Al Hayat (hyt_arb)
- An Nahar (nhr_arb)
- Al-Quds Al-Arabi (qds_arb)
- Ummah Press (umh_arb)
- Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names where
the data files are found and the 7-letter prefix that appears at the beginning
of every file name. The 7-letter codes consist of the three-character source
name IDs and the three-character language code ("arb") separated by
an underscore ("_") character.
These news services all use Modern Standard Arabic (MSA), so there should be
a fairly limited scope for orthographic and lexical variation due to regional
Arabic dialects. However, to the extent that regional dialects might have an
influence on MSA usage, the following should be noted:
- Al-Ahram is based in Cairo, Egypt.
- Asharq Al-Awsat is based in London, England, UK.
- An Nahar is based in Beirut, Lebanon.
- Al Hayat was originally a Lebanese news service, but it has been based in
London during the entire period represented in this archive.
- Assabah is based in Tunisia.
- The Xinhua and Agence France Presse (AFP) services are obviously international
in scope (Xinhua is based in Beijing, AFP in Paris), and the regional distribution of Arabic reporters and editors for these
services is not known.
- The content provided by Ummah Press comes from diverse sources throughout
the Arabic-speaking world.
- Al-Quds Al-Arabi is based in London, England, UK.
New in the Fourth Edition
- New Sources
This release marks the first edition of Arabic Gigaword to include content
from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period
from November 2006 through December 2008.
- New Data for Existing Sources
This release contains all data collected by LDC from January 2007 through
December 2008, except for Ummah Press for which data from January 2005 through
December 2008 is included.
The table below shows data quantity by source under the following categories:
data source (Source); the number of files per source (#Files); compressed file
size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated
words tokens in the text (K-words); and the number of documents per source (#DOCs).
For an example of the data contained in this corps, please examine this jpeg image of the text content.
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Portions © 1994-2008 Agence France Presse, © 2006-2008 Al-Ahram,
© 2006-2008 Al-Quds Al-Arabi, © 2006-2008 Asharq Al-Awsat, ©
2004-2008 Assabah, © 1994-2003, 2005-2008 Al Hayat, © 1995-2008 An
Nahar, © 2003-2008 Ummah Press, © 2001-2008 Xinhua News Agency, ©
2003, 2006, 2007, 2009 Trustees of the University of Pennsylvania