Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



English Gigaword Second Edition

Item Name: English Gigaword Second Edition
Authors: David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda
LDC Catalog No.: LDC2005T12
ISBN: 1-58563-350-X
Release Date: Jul 15, 2005
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES
Application(s): information retrieval, language modeling, natural language processing
Language(s): English
Language ID(s): eng
Distribution: 2 DVD
Member fee: $0 for 2005 members
Non-member Fee: N/A (Members Only)
Reduced-License Fee: N/A
Extra-Copy Fee: US $400.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda
2005
English Gigaword Second Edition
Linguistic Data Consortium, Philadelphia

Introduction

English Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T12 and ISBN 1-58563-350-X. The English Gigaword corpus is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC. This is the second edition of the English Gigaword corpus.

This edition includes all of the contents in the first edition of the English Gigaword corpus (LDC2003T05) as well as new data from July 2002 through Dec 2004. Also, a new newswire source (the Central New Agency of Taiwan, English Service) has been added in this edition.

The five distinct international sources of English newswire included in this release are the following:

Agence France-Presse, English Service(afp_eng )
Associated Press Worldstream, English Service (apw_eng)
Central News Agency of Taiwan, English Service(cna_eng)
The New York Times Newswire Service(nyt_eng)
The Xinhua News Agency, English Service(xin_eng)

What's New In The Second Edition

  • New newswire data contents from July 2002 to December 2004 have been added for all of the four newswire sources that were represented in the first edition.
  • A new source, the Central News Agency of Taiwan English Service (CNA_ENG), has been added.
  • We have adopted a new naming scheme for filenames and DOC IDs. The new naming scheme represents the source names in a three-letter code and the language name in a three-letter code.
  • Minor formatting improvements (mostly line-wrapping) have been made to some of the data contents originally published in the first edition.

    Content Copyright

    Portions © 1994-1997 and 2001-2004 Agence France-Presse, © 1994-2004 Associated Press, © 1997-2004 Central News Agency of Taiwan, © 1994-2004 New York Times, © 1995-2004 Xinhua News Agency, © 2005 Trustees of the University of Pennsylvania

    Pricing

    The Reduced Licensing Fee for this corpus is US$400.


  • About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

    Contact: ldc@ldc.upenn.edu

    (c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.