Introduction
The English Gigaword Corpus is a comprehensive archive of newswire
text data that has been acquired over several years by the Linguistic
Data Consortium (LDC) at the University of Pennsylvania. This is the
third edition of the English Gigaword Corpus.
This edition includes all of the contents in the previous edition
(LDC2005T12) as well as new data from the same five sources presented
there covering 24-month period of January 2005 through December 2006.
Also, a sixth data source (the Los Angeles Times/Washington Post
newswire service) has been added in this edition.
The six distinct international sources of English newswire included
in this edition are the following:
| Agence France-Presse, English Service | (afp_eng) |
| Associated Press Worldstream, English Service | (apw_eng) |
| Central News Agency of Taiwan, English Service | (cna_eng) |
| Los Angeles Times/Washington Post Newswire Service | (ltw_eng) |
| New York Times Newswire Service | (nyt_eng) |
| Xinhua News Agency, English Service | (xin_eng) |
The seven-letter codes in the parentheses above include the
three-character source name abbreviations and the three-character
language code ("eng") separated by an underscore ("_") character. The
three-letter language code conforms to LDC's internal convention
based on the new ISO 639-3 standard.
The seven-letter codes are used in both the directory names where
the data files are found, and in the prefix that appears at the
beginning of every data file name.
As with other Gigaword releases, some of the content in the this
corpus has been published previously by the LDC in a variety of other,
older corpora, particularly the North American News text corpora, the
various TDT corpora, and the AQUAINT text corpus, as well as earlier
editions of Gigaword English.
New in the Third Edition
- New newswire data contents from January 2005 to December 2006 have
been added for all of the five newswire sources that were
represented in the first edition.
- A new source, the Los Angeles Times/Washington Post newswire
service, has been added.
- A small handful of corrections to older APW data have been made to
remove a few non-English stories, clean up some character "noise",
and rectify the encoding for a few non-ASCII characters.
- The CNA content introduced in Gigaword English 2nd Edition has been
completely updated to repair data corruptions caused by occasional
character encoding problems; as a result of the update, there may be
differences in the inventory and/or ID strings of DOC elements in
this portion of the corpus, relative to the previous edition. (The
nature of encoding problems is explained below under "SOURCE
SPECIFIC PROPERTIES".)
- Many of the files (141 out of 722) include a small number of UTF-8
"wide" characters, typically accented letters found in proper names
and borrowed words (some sources also use special punctuation marks,
non-breaking spaces, etc).
Apart from the replacement/update of all CNA files, the data content
of the 2nd edition has been included in the present release without
modification.
Samples
For an example of the data in this corpus, please review this text file.
Update
The New York Times newswire
text archive in this corpus contains some articles in Spanish. A scan of the 149 monthly data files under "nyt_eng" yielded
2517 DOC elements with the 'type="story"' attribute where the
story content was in Spanish.
The scan also disclosed 421 DOC elements with the 'type="story"'
attribute where the text content was in fact not a news story.
Two additional files to the online documentation
for this corpus identify those occurrences.
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Content Copyright
Portions © 1994-2006 Agence France Presse, © 1994-2006 The
Associated Press, © 1997-2006 Central News Agency (Taiwan), ©
1994-1998, 2003-2006 Los Angeles Times-Washington Post News Service,
Inc., © 1994-2006 New York Times, © 1995-2006 Xinhua News Agency,
© 2007 Trustees of the University of Pennsylvania |