Introduction
Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan,
is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition
LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition --
from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated
with full part of speech tags.
In order to avoid any problems or confusion that could result from differences
in character-set specifications in the source data, all text files in this corpus
have been converted to UTF-8 character encoding. With some exceptions described
in the readme file, all characters in the text are either single-byte ASCII
or multi-byte Chinese.
All sources have been categorized into four distinct "types":
- story: This type of DOC represents a coherent
report on a particular topic or event, consisting of paragraphs and full sentences.
- multi: This type of DOC contains a series
of unrelated "blurbs," each of which briefly describes a particular topic or
event; examples include "summaries of today's news," "news briefs in ..." (some
general area like finance or sports), and so on.
- advis: These are DOCs which the news service
addresses to news editors; they are not intended for publication to the "end
users."
- other: These DOCs clearly do not fall into
any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.
Data
The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents.
|
Source | #Files | Rzip-MB | Totl-MB | K-wrds | #DOCs |
|
CNA_CMN | 168 | 994 | 7363 | 792195 | 1769953 |
|
XIN_CMN | 168 | 615 | 4535 | 471110 | 992261 |
|
ZBN_CMN | 10 | 40 | 223 | 28066 | 41418 |
|
TOTAL | 346 | 1648 | 12121 | 1291371 | 2803632 |
The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:
|
#DOCs | K-wrds |
|
type="advis": |
|
CNA_CMN | 8160 | 751 |
|
XIN_CMN | 6553 | 711 |
|
ZBN_CMN | 0 | 0 |
|
TOTAL | 14713 | 1462 |
|
type="multi": |
|
CNA_CMN | 30552 | 23429 |
|
XIN_CMN | 11329 | 7516 |
|
ZBN_CMN | 55 | 41 |
|
TOTAL | 41936 | 30986 |
|
type="other": |
|
CNA_CMN | 100758 | 40258 |
|
XIN_CMN | 31255 | 9999 |
|
ZBN_CMN | 279 | 130 |
|
TOTAL | 132292 | 50387 |
|
type="story": |
|
CNA_CMN | 1630483 | 727748 |
|
XIN_CMN | 943132 | 452878 |
|
ZBN_CMN | 41084 | 27898 |
|
TOTAL | 2614691 | 1208524 |
The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.
The test result is shown as follows:
|
Doc# | RefWord# | TestWord# | MatchWord# | Recall (%) | Precision (%) | F-Score (%) |
|
|
|
Bakeoff 2005 | 190 | 116509 | 116443 | 112091 | 96.2 | 96.3 | 96.2 |
|
Bakeoff 2006 | 148 | 90405 | 90327 | 87332 | 96.6 | 96.7 | 96.6 |
Note:
Recall=MatchWord# / RefWord#
Precision=MatchWord# / TestWord#
F-Score=2 * Recall * Precision / (Recall + Precision)
Samples
For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.
Content Copyright
Portions © 2005-2007 Academia Sinica, © 1991-2004 Central News Agency
(Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency,
© 2005, 2007 Trustees of the University of Pennsylvania |