Introduction
Arabic Gigaword Third Edition is a comprehensive archive of newswire text data
acquired from Arabic news sources by the LDC at the University of Pennsylvania.
Arabic Gigaword Third Edition includes all of the content of
Arabic Gigaword Second Edition (LDC2006T02) as well as new data collected
after the publication of that edition. Also, an archive from a new newswire
source -- Assabah -- has been included in the third editon.
The six distinct sources of Arabic newswire represented in the third edition
are:
- Agence France Presse (afp_arb)
- Assabah (asb_arb)
- Al Hayat (hyt_arb)
- An Nahar (nhr_arb)
- Ummah Press (umh_arb)
- Xinhua News Agency (xin_arb)
The seven-character codes in the parantheses above consist of the three-character
source name IDs and the three-character language code ("arb") separated
by an underscore ("_") character.
The epochs and document counts for the data in the third edition are set forth
below:
| Newly Added Data |
| Source | Date Span | Document Count | |
| Agence France Presse | 2005.01 - 2006.12 | 137815 |
| Assabah News Agency | 2004.09 - 2006.12 | 15410 | (new source) |
| Al Hayat News Agency | 2005.01 - 2006.1 | 8799 | (no data for 2004) |
| An Nahar News Agency | 2005.01 - 2006.12 | 104950 | (no data for 2004) |
| Xinhua News Agency | 2005.01 - 2006.12 | 135472 | |
Data
This release contains 547 files, totalling approximately 1.8GB in compressed
form (6,673 MB uncompressed) and 1,994,735 K-words.
The table below shows data quantity by source under the following categories:
data source (Source); the number of files per source (#Files); compressed file
size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated
words tokens in the text (K-words) and the number of documents per source (#DOCs).
| Data Sources and Quanities |
| Source | #Files | Gzip-MB | Totl-MB | K-wrds | #DOCs |
| afp_arb | 152 | 441 | 1806 | 147612 | 798436 |
| asb_arb | 28 | 23 | 77 | 6587 | 15410 |
| hyt_arb | 142 | 559 | 1932 | 171502 | 378353 |
| nhr_arb | 134 | 612 | 2172 | 193732 | 449340 |
| umh_arb | 24 | 4 | 14 | 1201 | 4645 |
| xin_arb | 67 | 171 | 672 | 56165 | 348551 |
| TOTAL | 547 | 1810 | 6673 | 576799 | 1994735 |
All text files in this corpus have been converted to UTF-8 character encoding.
Certain data and formatting issues observed in previous releases of Arabic
Gigaword have been normalized in the third edition:
- Approximately 15,000 stories from older AFP files (1994 - 2002) contained
very brief documents where the text content was not recognized as such; in those
cases, the TEXT element appeared empty while the HEADLINE element contained
anywhere from three to several lines of text. The content of these documents
has been rearranged. The first line remains as the headline and the rest of
the lines have been moved into the text segment. All stories of this sort had
been originally classified as "other", and that classification has not been
changed in this edition.
- Al Hayat data from 2002 and 2003 contained some Arabic-Indic digits, despite
the intention to convert all digit strings to the ASCII digit characters for
consistency. The digits have now been converted to the ASCII range. For more
details about the encoding challenges presented by this data, see the readme
file accompanying this corpus.
- Some Al Hayat data had stray angle-bracket characters ("<" and ">"),
which have been rendered as "<" and ">". There were
also some defective "Doc-ID" strings (the 'id' attribute in the ""
tag that begins each news story) in the January 2001 data.
- Some An Nahar data had "bare" ampersand characters ("&") which have been
rendered as "&".
- Some Xinhua documents included empty sub-elements (HEADLINE,
DATELINE and/or TEXT sections containing no data); when HEADLINE
or DATELINE were empty, these tags were removed. When the TEXT
segment was empty, the document as a whole was removed.
- In several Xinhua stories, the Doc-ID string, which is supposed to provide
the year, month, date and sequence number for the story, had become garbled,
yielding an incorrect or impossible date string. A separate data file in the
"docs" directory, called "docid_changes.txt", lists the changes in document
inventory and Doc-ID strings.
- Xinhua stories typically end with a formulaic Arabic string (meaning "end-of-story"),
which should not have been included as part of the final paragraph in each story.
- In general, consistent line-wrapping was applied to make the overall text
presentation consistent across all sources and with Gigaword releases in other
languages. The markup pattern was also applied consistently for all sources
without exception.
Samples
For an example of the data contained in this corpues, please view this image of sample text
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Content Copyright
Portions © 1994-2006 Agence France Presse, © 2004-2006 Assabah, ©
1994-2003, 2005-2006 Al Hayat, © 1995-2006 An Nahar, © 2003-2004 Ummah
Press Service, © 2001-2006 Xinhua News Agency, © 2003, 2005-2007 Trustees
of the University of Pennsylvania |