Arabic Gigaword Fifth Edition
Introduction
Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number
LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive
archive of newswire text data that has been acquired from Arabic news sources
by LDC at the University of Pennsylvania. Arabic Gigaword Fifth Edition includes
all of the content of the fourth edition of Arabic Gigaword (LDC2009T30)
plus new data covering the period from January 2009 through December 2010.
Nine distinct sources of Arabic newswire are represented here:
- Asharq Al-Awsat (aaw_arb)
- Agence France Presse (afp_arb)
- Al-Ahram (ahr_arb)
- Assabah (asb_arb)
- Al Hayat (hyt_arb)
- An Nahar (nhr_arb)
- Al-Quds Al-Arabi (qds_arb)
- Ummah Press (umh_arb)
- Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory
names where the data files are found, and the 7-letter prefix that
appears at the beginning of every file name. The 7-letter codes
consist of the three-character source name IDs and the three-character
language code (arb) separated by an underscore (_) character.
The three-character language code conforms to the
ISO 639-3 standard.
In addition to adding new data, the following updates were made:
- Repeated documents in Asharq Al-Awsat data from 2008 were removed.
- Document formatting and docid duplication problems were corrected
in Agence France Presse (AFP) data.
- Significant duplication of content in 2007-2008 An Nahar data was detected,
and the duplicated documents were removed.
More details about these changes can be found in the included readme
file.
Data
All text data are presented in SGML form, using a very simple, minimal
markup structure. For every opening tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a
corresponding closing tag -- always. The attribute values in the
DOC tag are always presented within double-quotes the id= attribute
of DOC consists of the 7-letter source abbreviation (in CAPS), an
underscore character, an 8-digit date string representing the date of
the story (YYYYMMDD), a period, and a 4-digit sequence number starting
at 0001 for each date (e.g. XIN_ARB_200101.0001) in this way,
every DOC in the corpus is uniquely identifiable by the id string.
For this release, all sources have received a uniform treatment in
terms of quality control, and we have applied a rudimentary (and
_approximate_) categorization of DOC units into four distinct types.
The classification is indicated by the type=string attribute
that is included in each opening DOC tag. The four types are:
- story: This is by far the most frequent type, and it represents the
most typical newswire item: a coherent report on a particular topic
or event, consisting of paragraphs and full sentences.
- multi: This type of DOC contains a series of unrelated blurbs,
each of which briefly describes a particular topic or event this is
typically applied to DOCs that contain summaries of todays news,
news briefs in ... (some general area like finance or sports), and
so on.
- other: This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for
broad circulation (they are not advisories), they may be topically
coherent (unlike multi type DOCs), and they typically do not
contain paragraphs or sentences (they arent really stories)
these are things like lists of sports scores, stock prices,
temperatures around the world, and so on.
Other Gigaword corpora (e.g., in English and Chinese) have a fourth category,
advis (for advisory), which applies to DOCs that contain text intended solely
for news service editors, not the news-reading public. The task of determining
patterns for assigning non-story type labels was carried out by a native speaker
of Arabic, and the advis category was determined to be inapplicable to the
data.
Note that the markup was applied algorithmically, using logic that was
based on less-than-complete knowledge of the data. For the most part,
the HEADLINE, DATELINE and TEXT tags have their intended content but
due to the inherent variability (and the inevitable source errors) in
the data, users may find occasional mishaps where the headline and/or
dateline were not successfully identified (hence show up within TEXT),
or where an initial sentence or paragraph has been mistakenly tagged
as the headline or dateline.
Sample
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency,
GALE Program Grant No. HR0011-06-1-0003. The content of this publication does
not necessarily refelct the position or policy of the Government, and no official
endorsement should be inferred.
Updates
None at this time.
Content Copyright
Portions © 1994-2010 Agence France Presse, © 2006-2010 Al-Ahram,
© 2006-2010 Al-Quds Al-Arabi, © 2006-2010 Asharq Al-Awsat, ©
2004-2010 Assabah, © 1994-2003, 2005-2010 Al Hayat, © 1995-2010 An
Nahar, © 2003-2010 Ummah Press, © 2001-2010 Xinhua News Agency, ©
2003, 2006, 2007, 2009, 2011 Trustees of the University of Pennsylvania
|