Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Chinese Gigaword

Item Name: Chinese Gigaword
Authors: David Graff and Ke Chen
LDC Catalog No.: LDC2003T09
ISBN: 1-58563-230-9
Release Date: May 22, 2003
Data Type: text
Data Source(s): newswire
Project(s): EARS, GALE, TIDES
Application(s): information retrieval, language modeling, natural language processing
Language(s): Mandarin Chinese
Language ID(s): cmn
Distribution: 1 DVD
Member fee: $0 for 2003 members
Non-member Fee: US $3000.00
Reduced-License Fee: US $1500.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: David Graff and Ke Chen
2003
Chinese Gigaword
Linguistic Data Consortium, Philadelphia

Introduction

Chinese Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T09 and ISBN 1-58563-230-9. This is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by the LDC over several years.

Two distinct international sources of Chinese newswire are represented here:
Central News Agency of Taiwan(cna)
Xinhua News Agency of Beijing(xin)

Some of the Xinhua content in this collection has been published previously by the LDC in other, older corpora, particularly Mandarin Chinese News Text (LDC95T13), TREC Mandarin (LDC2000T52), and the various TDT Multilanguage Text corpora. But all of the CNA data and a significant amount of Xinhua material is being released here for the first time.

Data

There are 286 files, totalling approximately 1.5GB in compressed form.

The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (nearly four gigabytes, total), K-wrds are actually the number of Chinese characters (there is no notion of "space-separated word tokens" in Chinese), and number of documents.

Source#FilesGzip-MBTotl-MBK-wrds#DOCs
CNA144101826067354991649492
XIE1425481331382881817348
TOTAL2861566393711183802466840

The original data archives received by the LDC from Xinhua were encoded in GB-2312, whereas those from CNA were encoded in Big-5. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the 0readme.txt file, all characters in the text are either single-byte ASCII or multi-byte Chinese.

Each data file name consists of a three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file provided in the corpus.

Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).

All sources have received a uniform treatment in terms of quality control and have been categorized into four distinct "types":
story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences
multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on
advis these are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users"
other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on

The general strategy for categorizing DOCs into these four classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the three "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."

Updates

There are no updates at this time.

Content Copyright

Portions © 1991-2002 Central News Agency of Taiwan, © 1990-2002 Xinhua News Agency


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.