Introduction
Chinese Gigaword Third Edition is a comprehensive archive of newswire text
data that has been acquired over several years by the LDC. This edition includes
all of the contents in
Chinese Gigaword Second Edition (LDC2005T14) as well as new data collected
after the publication of that edition. Also, an archive of articles from a new
newswire source (Agence France Presse) has been added in the third edition.
The four distinct international sources of Chinese newswire included
in this edition are the following:
- Agence France Presse (afp_cmn)
- Central News Agency, Taiwan (cna_cmn)
- Xinhua News Agency (xin_cmn)
- Zaobao Newspaper (zbn_cmn)
The seven-letter codes in the parentheses above are used for the
directory names and data files for each source, and are also used (in
ALL_CAPS) as part of the unique DOC "id" string assigned to each news
article.
Data
The original data archives received by the LDC from Agence France Presse, Xinhua
News Agency and Zaobao were encoded in GB-2312, whereas those from Central News
Agency (CNA) were encoded in Big-5. To avoid the problems and confusion that
could result from differences in character-set specifications, all text files
in this corpus have been converted to UTF-8 character encoding.
New in the Third Edition
- Over six years worth of articles (October 2000 through December 2006)
from Agence France Presse are being released for the first time.
- Two years worth of new articles (January 2005 through December 2006)
have been added to the Xinhua data set.
- Nearly two years worth of content was added to the CNA data set.
There was a gap in the LDC's collection from this source during 2006:
no CNA Chinese content was collected between July 27 and December 17
2006, inclusive, so there are no data files for August through
November of that year, and the December data file is about half its
expected size.
- A small set of older stories (October through December 1998) have been added
from Zaobao; these were previously published by LDC as part of TDT3 Multilanguage
Text Version 2.0 (LDC2001T58) and are being included in Gigaword for the first
time.
Samples
Please examine this sample(JPEG) for an example of the data in this corpus.
Content Copyright
Portions © 2000-2006 Agence France Presse, © 1991-2006 Central News
Agency (Taiwan), © 1998, 2000-2003 SPH AsiaOne, Ltd., © 1990-2006
Xinhua News Agency, © 2003, 2005, 2007 Trustees of the University of Pennsylvania |