Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Arabic Gigaword Second Edition

Item Name: Arabic Gigaword Second Edition
Authors: David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda
LDC Catalog No.: LDC2006T02
ISBN: 1-58563-371-2
Release Date: Jan 19, 2006
Data Type: text
Data Source(s): newswire
Application(s): information retrieval, language modeling, natural language processing
Language(s): Modern Standard Arabic
Language ID(s): arb
Distribution: 1 DVD
Member fee: $0 for 2006 members
Non-member Fee: US $3000.00
Reduced-License Fee: US $1500.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff, et al.
2006
Arabic Gigaword Second Edition
Linguistic Data Consortium, Philadelphia

Introduction

Arabic Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2006T02 and ISBN 1-58563-371-2. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC), at the University of Pennsylvania.

Arabic Gigaword Second Edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data.

Five distinct sources of Arabic newswire are represented here:

Agence France Presse(afp_arb; formally afa)
Al Hayat News Agency(hyt_arb; formally alh)
An Nahar News Agency(nhr_arb; formally ann)
Ummah Press(umh_arb)
Xinhua News Agency(xin_arb; formally xia)

The seven-letter codes in the parentheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-letter language code represents the standard Arabic in the ISO 639-3 standard. In the first edition of the Arabic Gigaword corpus, a simpler three-character-code scheme was used to identify both the source and the language. The new convention allows us to distinguish data sets by source and language more naturally when a single newswire provider distributes data in multiple languages.

Ummah Press is a new source added to the Second Edition. The following table shows the new data that appear for the first time in the Second Edition.

Agence France Presse 2003.01-2004.12 143,766 documents
Al Hayat News Agency 2002.01-2003.12 64,308 documents
An Nahar News Agency 2003.01-2004.01 16,316 documents
Ummah Press 2003.01-2004.12 4,641 documents
Xinhua News Agency 2003.06-2004.12 10,6236 documents

Data

There are 423 files, totaling approximately 1.4GB in compressed form (5,359 MB uncompressed, and 1,591,983 K-words).

The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 5.3 gigabytes total), K-words are the number of space-separated tokens in the text, excluding SGML tags.


Source#FilesGzip-MBTotl-MBK-wrds#DOCs
AFP_ARB1283551429123594660621
HYT_ARB1195241861169100369555
NHR_ARB1094571649151078344084
UMH_ARB2441312014645
XIN_ARB4310340736933213082
TOTAL423144353594819061591987


All text files in this corpus have been converted to UTF-8 character encoding.

Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately.

Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). Therefore, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file.

Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).

All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types":
story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences
multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)" and so on
other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on

The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."

Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data.

As described in the introduction section, a new naming scheme for file names and document IDs is used in the Second Edition. All of the documents in the first edition of the Arabic Gigaword corpus can be mapped to the same documents in this edition by changing the prefix of DOC IDs and file names as below. The upper case letters are used for the DOC IDs; the lower case letters are used for the file and directory names. The underscore character to connect the seven-letter prefix and the date is included in the following table.

OldNew
AFAAFP_ARB_
ALHHYT_ARB_
ANNNHR_ARB
XIAXIN_ARB_

Samples

For an example of the data in this corpus, please examine this screenshot which is an image of the text from a single file.

Content Copyright

Portions © 1994-2004 Agence France Presse, © 1994-2003 Al Hayat News Agency, © 1995-2004 An Nahar News Agency, © 2001-2004 Xinhua News Agency, © 2003-2004 Ummah Press, © 2005-2006 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.