Introduction
Chinese Gigaword Fifth Edition was produced by the Linguistic Data Consortium (LDC). It is a comprehensive
archive of newswire text data that has been acquired from Chinese news sources
by LDC at the University of Pennsylvania. Chinese Gigaword Fifth Edition includes
all of the content of the fourth edition of Chinese Gigaword (LDC2009T27)
plus new data covering the period from January 2009 through December 2010.
Eight distinct sources of Chinese newswire are represented here:
- Agence France Presse(afp_cmn)
- Central News Agency, Taiwan(cna_cmn)
- Central News Service(cns_cmn)
- Guangming Daily(gmw_cmn)
- Peoples Daily(pda_cmn)
- Peoples Liberation Army Daily(pla_cmn)
- Xinhua News Agency(xin_cmn)
- Zaobao Newspaper(zbn_cmn)
The seven-letter codes in the parentheses above are used for the directory
names and data files for each source and are also used (in ALL_CAPS) as part
of the unique DOC id string assigned to each news article.
Articles covering the period from January 2009 through December 2010 have been
added to the Agence France Presse, Central News Agency (CNA), Central News Service,
Guangming Daily, Peoples Liberation Army Daily and Xinhua News Agency data
sets. The data from Peoples Daily covers the period from late June 2009 through
December 2010. No new data from Zaobao has been added. Additionally, Zaobao
and CNA data included in previous releases were found to contain non-normalized
full-width characters. Those files have been normalized to correct that issue.
Data
Each data file name consists of the 7-letter prefix (e.g., xin_cmn)
and an underscore character (_) followed by a 6-digit date
(representing the year and month during which the file contents were
originally published by the respective news source), followed by a
.gz file extension, indicating that the file contents have been
compressed using the GNU gzip compression utility (RFC 1952). So,
each file contains all the usable data received by LDC for the given
month from the given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure. The file gigaword_c.dtd in the docs directory
provides the formal Document Type Declaration for parsing the SGML
content. The corpus has been fully validated by a standard SGML
parser utility (nsgmls), using this DTD file.
For this release, all sources have received a uniform treatment in
terms of quality control, and we have applied a rudimentary (and
_approximate_) categorization of DOC units into four distinct types.
The classification is indicated by the type=string attribute
that is included in each opening DOC tag. The four types are:
- story: This is by far the most frequent type, and it represents the
most typical newswire item: a coherent report on a particular topic
or event, consisting of paragraphs and full sentences.
- multi: This type of DOC contains a series of unrelated blurbs,
each of which briefly describes a particular topic or event this is
typically applied to DOCs that contain summaries of todays news,
news briefs in ... (some general area like finance or sports), and
so on.
- advis : (short for advisory) These are DOCs which the news service
addresses to news editors -- they are not intended for publication
to the end users (the populations who read the news).
We also find a lot of formulaic, repetitive content
in DOCs of this type (contact phone numbers, etc).
- other: This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for
broad circulation (they are not advisories), they may be topically
coherent (unlike multi type DOCs), and they typically do not
contain paragraphs or sentences (they arent really stories)
these are things like lists of sports scores, stock prices,
temperatures around the world, and so on.
Sample
Updates
None at this time.
Sponsorship
This work was supported in part by the Defensed Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect
the position or policy of the Government, and no official endorsement should be inferred.
Content Copyright
Portions © 2000-2010 Agence France Presse, © 1991-2010 Central News
Agency (Taiwan), © 2006-2010 China Military Online, © 2006-2010 Chinanews.com,
© 2006-2010 Guangming Daily, © 2006-2010 Peoples Daily, © 1998,
2000-2003 SPH AsiaOne, Ltd., © 1990-2010 Xinhua News Agency, © 2003,
2005, 2007, 2009, 2011 Trustees of the University of Pennsylvania
|