English Gigaword was produced by Linguistic Data Consortium (LDC)
catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive
archive of newswire text data in English that has been acquired over
several years by the LDC.
Four distinct international sources of English newswire are represented here:
|Agence France Press English Service||(afe)|
|Associated Press Worldstream English Service||(apw)|
|The New York Times Newswire Service||(nyt)|
|The Xinhua News Agency English Service||(xie)|
Much of the content in this collection has been published previously
by the LDC in a variety of other, older corpora, particularly the
North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the
AQUAINT text corpus (LDC2002T31).
But there is a significant amount of material
that is being released here for the first time: all of the Agence
France Presse content, the 1995 and 2001 Xinhua content, and the
portions of NYT and APW dating from February 2001 forward.
Each data file name consists of the three-letter prefix, followed by a
six-digit date (representing the year and month during which the file
contents were delivered by the respective news source), followed by a
".gz" file extension, indicating that the file contents have been
compressed using the GNU "gzip" compression utility (RFC 1952). So,
each file contains all the usable data received by LDC for the given
month from the given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The corpus has been fully validated by a standard
SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.
Please follow this link for a sample file.
The markup structure, common to all data files, can be summarized as
The Headline Element is Optional -- not all DOCs have one
The Dateline Element is Optional -- not all DOCs have one
Paragraph tags are only used if the "type" attribute of the DOC
happens to be "story"
Note that all data files use the UNIX-standard "
" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less
For this release, all sources have received a uniform treatment in
terms of quality control and we have applied a rudimentary (and
_approximate_) categorization of DOC units into four distinct "types."
The classification is indicated by the "type="string" " attribute
that is included in each opening DOC tag. The four types are: story,
multi, advis and other.
Statistics regarding the quantities of data for each source are summarized
below. Note that the "Totl-MB" numbers show the amount of
data you get when the files are not compressed (i.e. nearly 12 gigabytes,
total); the "Gzip-MB" column shows totals for compressed file
sizes as stored on the DVD-ROM; the "K-wrds" numbers are
simply the number of whitespace-separated tokens (of all types) after
all SGML tags are eliminated.
There are no updates available at this time.
Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002
Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua
News Agency, © 2002 Trustees of the University of Pennsylvania