Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Chinese Gigaword Fourth Edition

Item Name: Chinese Gigaword Fourth Edition
Authors: Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda
LDC Catalog No.: LDC2009T27
ISBN: 1-58563-527-8
Release Date: Sep 15, 2009
Data Type: text
Data Source(s): newswire
Project(s): GALE
Application(s): information retrieval, language modeling, natural language processing
Language(s): Mandarin Chinese
Language ID(s): cmn
Distribution: 1 DVD
Member fee: $0 for 2009 members
Non-member Fee: US $5000.00
Reduced-License Fee: US $2500.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Robert Parker, et al.
2009
Chinese Gigaword Fourth Edition
Linguistic Data Consortium, Philadelphia


Introduction

Chinese Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T27 and isbn 1-58563-527-8, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in Chinese Gigaword Third Edition (LDC2007T38) as well as newly collected data. In addition, four entirely new sources have been added in the fourth edition, Central News Service, Guangming Daily, People's Liberation Army Daily, and People's Daily.

The eight distinct international sources of Chinese newswire included in this edition are the following:

  • Agence France Presse (afp_cmn)
  • Central News Agency, Taiwan (cna_cmn)
  • Central News Service (cns_cmn)
  • Guangming Daily (gmw_cmn)
  • People's Daily (pda_cmn)
  • People's Liberation Army Daily (pla_cmn)
  • Xinhua News Agency (xin_cmn)
  • Zaobao Newspaper (zbn_cmn)

The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC "id" string assigned to each news article.

Data

The original data received by the LDC from AFP, People's Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were in Big-5, and those from GMW, CNS, and People's Daily were in a combination of GB-2312 and GB-18030. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding.

New in the Fourth Edition

  • Two years worth of new articles (January 2007 through December 2008) have been added to the Xinhua, Agence France Presse, and CNA data sets.
  • Four new data sources have been added - Guangming Daily, Central News Service, People's Daily and People's Liberation Army daily, covering a timespan from November 2006 through December 2008.

Samples

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Content Copyright

Portions © 2000-2008 Agence France Presse,© 1991-2008 Central News Agency (Taiwan),© 2006-2008 China Military Online, © 2006-2008 Chinanews.com, © 2006-2008 Guangming Daily, © 2006-2008 People's Daily, © 1998, 2000-2003 SPH AsiaOne, Ltd., © 1990-2008 Xinhua News Agency, © 2003, 2005, 2007, 2009 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.