Introduction
Tagged Chinese Gigaword Version 2.0, created by scholars at
Academia Sinica, Taipei, Taiwan, is a part-of-speech tagged version of LDC's
Chinese
Gigaword Second Edition (LDC2005T14). Like the original release, Version
2.0 contains all of the data in Chinese Gigaword Second Edition -- from Central
News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with full part
of speech tags. In addtion, this new release removes residual noises in the
original and improves tagging accuracy by incorporating lexica of unknown words.
The changes represented in Version 2.0 include the following:
- A single-width space is used consistently between two segmented words.
- The position of the newline character remains fixed, better reflecting the
source files from Chinese
Gigaword Second Edition (LDC2005T14).
- The original coding of partial Latin letters or Arabic numerals is preserved.
- 1,192 documents from Central News Agency (Taiwan) and 13 documents from
Xinhua News Agency that were missing from the first publication are included.
- A set of heuristics for building out-of-vocabulary dictionaries to improve
annotation quality of very large corpora is incorporated.
Documents in the corpus were assigned one of the following categories:
- story: This type of DOC represents a coherent
report on a particular topic or event, consisting of paragraphs and full sentences.
- multi: This type of DOC contains a series
of unrelated "blurbs," each of which briefly describes a particular topic or
event; examples include "summaries of today's news," "news briefs in ..." (some
general area like finance or sports), and so on.
- advis: These are DOCs which the news service
addresses to news editors; they are not intended for publication to the "end
users."
- other: These DOCs clearly do not fall into
any of the above types; they include items such as lists of sports scores, stock
prices, temperatures around the world, and so on.
Data
Basic statistics of data from each source are summarized below.
| Source |
No.
Files |
Compressed
Size(MB) |
Total
Size(MB) |
No.
Words(thousands) |
No.
Documents |
| CNA_CMN |
168 |
1520 |
6136 |
501456 |
1769953 |
| XIN_CMN |
168 |
898 |
3755 |
311660 |
992261 |
| ZBN_CMN |
10 |
55 |
214 |
18632 |
41418 |
| TOTAL |
346 |
2473 |
10105 |
831748 |
2803632 |
The POS tags and their corresponding explanations are listed below:
| Tag |
Explanation_Chinese |
Explantation_English |
| A |
非謂形容詞 |
Non-predicative adjective |
| Caa |
對等連接詞,如:和、跟 |
Conjunctive conjunction |
| Cab |
連接詞,如:等等 |
Conjunction, e.g.deng3deng3 |
| Cba |
連接詞,如:的話 |
Conjunction, e.g.de5hua4 |
| Cbb |
關聯連接詞 |
Correlative Conjunction |
| D |
副詞 |
Adverb |
| Da |
數量副詞 |
Quantitative Adverb |
| DE |
的,
之, 得, 地 |
Particle DE and its functional
equivalents |
| Dfa |
動詞前程度副詞 |
Pre-verbal Adverb of degree |
| Dfb |
動詞後程度副詞 |
Post-verbal Adverb of degree |
| Di |
時態標記 |
Aspectual Adverb |
| Dk |
句副詞 |
Sentential Adverb |
| FW |
外文標記 |
Foreign Word |
| I |
感嘆詞 |
Interjection |
| Na |
普通名詞 |
Common Noun |
| Nb |
專有名稱 |
Proper Noun |
| Nc |
地方詞 |
Place Noun |
| Ncd |
位置詞 |
Localizer |
| Nd |
時間詞 |
Time Noun |
| Nep |
指代定詞 |
Demonstrative Determinatives |
| Neqa |
數量定詞 |
Quantitative Determinatives |
| Neqb |
後置數量定詞 |
Post-quantitative Determinatives |
| Nes |
特指定詞 |
Specific Determinatives |
| Neu |
數詞定詞 |
Numeral Determinatives |
| Nf |
量詞 |
Measure |
| Ng |
後置詞 |
Postposition |
| Nh |
代名詞 |
Pronoun |
| P |
介詞 |
Preposition |
| SHI |
是 |
you3 (to have) |
| T |
語助詞 |
Particle |
| VA |
動作不及物動詞 |
Active Intransitive Verb |
| VAC |
動作使動動詞 |
Active Causative Verb |
| VB |
動作類及物動詞 |
Active Pseudo-transitive Verb |
| VC |
動作及物動詞 |
Active Transitive Verb |
| VCL |
動作接地方賓語動詞 |
Active Verb with a Locative
Object |
| VD |
雙賓動詞 |
Ditransitive Verb |
| VE |
動作句賓動詞 |
Active Verb with a Sentential
Object |
| VF |
動作謂賓動詞 |
Active Verb with a Verbal Object |
| VG |
分類動詞 |
Classificatory Verb |
| VH |
狀態不及物動詞 |
Stative Intransitive Verb |
| VHC |
狀態使動動詞 |
Stative Causative Verb |
| VI |
狀態類及物動詞 |
Stative Pseudo-transitive Verb |
| VJ |
狀態及物動詞 |
Stative Transitive Verb |
| VK |
狀態句賓動詞 |
Stative Verb with a Sentential
Object |
| VL |
狀態謂賓動詞 |
Stative Verb with a Verbal Object |
| V_2 |
有 |
有 |
Since neither manual
checking nor automatic checking against a gold standard is feasible for gigaword
size corpora, the authors proposed quality assurance of automatic annotation
of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems
(Huang et al., 2008). By comparing to word lists generated from the ICTCLAS
version of an automatic tagged Xinhua portion of Chinese Gigaword, a set of
heuristics for building out-of-vocabulary dictionaries to improve quality were
proposed. Randomly selected texts for evaluating effects of these out-of-vocabulary
dictionaries were manually checked. Experimental results indicate that there
were 30,562 correct words (about 97.3 %) of tested words. The quality control
test result follows:
| Corpora |
Thousands of words |
No. Test words |
No. Correct Words |
| CNA |
501459 |
42,695 |
41,449 |
| XIN |
311718 |
28,744 |
27,967 |
| ZBN |
18632 |
22,825 |
22,270 |
| Total |
831809 |
31,421 |
30,562 |
Samples
For an example of the data in this publication, please examine the following screen capture of a tagged file.
Content Copyright
Portions © 2005-2009 Academia Sinica, © 1991-1994 Central News Agengy
(Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency,
© 2005, 2007, 2009 Trustees of the University of Pennsylvania |