Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



English Gigaword

Item Name: English Gigaword
Authors: David Graff, Christopher Cieri
LDC Catalog No.: LDC2003T05
ISBN: 1-58563-260-0
Release Date: Jan 28, 2003
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES
Application(s): information retrieval, language modeling, natural language processing
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 2003 members
Non-member Fee: US $3000.00
Reduced-License Fee: US $1500.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff, Christopher Cieri
2003
English Gigaword
Linguistic Data Consortium, Philadelphia

Introduction

English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC.

Four distinct international sources of English newswire are represented here:
Agence France Press English Service(afe)
Associated Press Worldstream English Service(apw)
The New York Times Newswire Service(nyt)
The Xinhua News Agency English Service(xie)

Data

Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward.

Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were delivered by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.

Please follow this link for a sample file.

The markup structure, common to all data files, can be summarized as follows:

The Headline Element is Optional -- not all DOCs have one

The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the "type" attribute of the DOC happens to be "story"

Note that all data files use the UNIX-standard " " form of line termination, and text lines are generally wrapped to a width of 80 characters or less

For this release, all sources have received a uniform treatment in terms of quality control and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types." The classification is indicated by the "type="string" " attribute that is included in each opening DOC tag. The four types are: story, multi, advis and other.

Statistics regarding the quantities of data for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source#FilesGzip-MBTotl-MBK-wrds#DOCs
AFE444171216170969656269
APW91121336475396651477466
NYT96210459069141591298498
XIE83320940131711679007
TOTAL31440541170917565044111240

Updates

There are no updates available at this time.

Content Copyright

Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002 Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua News Agency, © 2002 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.