Introduction
Chinese Treebank 7.0, Linguistic Data Consortium (LDC) catalog number LDC2010T07
and isbn 1-58563-542-1, consists of over one million words of annotated and
parsed text from Chinese newswire, magazine news, various broadcast news and
broadcast conversation programs, web newsgroups and weblogs.
The Chinese Treebank project began at the University of Pennsylvania in 1998,
continued at the University of Colorado and is now at Brandeis
University. The projects goal is to provide a large, part-of-speech tagged
and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank
1.0, contained 100,000 syntactically annotated words from Xinhua News Agency
(Xinhua) newswire. It was later corrected and released in 2001 as Chinese
Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words.
The LDC released Chinese
Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000
words, in 2004. A year later, the LDC published the 500,000 word Chinese
Treebank 5.0 (LDC2005T01). Chinese
Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words.
Chinese Treebank 7.0 adds new annotated newswire data, broadcast material and
web text to this effort.
Data
This release consists of 2,448 text files, 51,447 sentences, 1,196,329 words
and 1,931,381 hanzi (Chinese characters). The data is provided in UTF-8 encoding
and the annnotation has Penn Treebank-style labeled brackets. Details of the
annotation standard can be found in the enclosed segmentation, POS-tagging and
bracketing guidelines. The data is provided in four different formats: raw text,
word segmented, word segmented and POS-tagged and syntactically-bracketed formats.
Chinese Treebank 7.0 includes text from the following genres and sources:
|
Genre
|
# words
|
| Newswire (Agence France Presse, China
News Service, Guangming Daily, Peoples Daily, Xinhua News Agency) |
260,164 |
| News Magazine (Sinorama) |
256,305
|
| Broadcast News (China Broadcasting
System, China Central TV, China National Radio, China Television System,
New Tang Dynasty TV, Phoenix TV, Voice of America) |
287,442
|
| Broadcast Conversation (Anhui TV,
China Central TV, CNN, MSNBC, New Tang Dynasty TV, Phoenix TV) |
184,161
|
Newsgroups, Weblogs
|
208,257
|
|
Total
|
1,196,329
|
Sponsorship
This work was supported in part by the Defense Advanced Research Projects
Agency, GALE Program Grant No. HR0011-06-0022. The content of this publication
does not necessarily reflect the position or the policy of the Government, and
no official endorsement should be inferred.
Samples
For an example of the data in this corpus, please review the
sample file.
Updates
No updates have been issued as of this time.
Content Copyright
Portions © 2006 Agence France Presse, © 2006 Anhui TV, © 2005
Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, ©
2000-2001, 2005-2006 China Central TV, © 2000-2001 China National Radio,
© 2006 Chinanews.com, © 2000-2001 China Television System, ©
2006 Guangming Daily, © 2006 National Broadcasting Company, Inc., ©
2006 New Tang Dynasty TV, © 2006 Peoples Daily Online, © 2005-2006
Phoenix TV, © 1999-2001 Sinorama Magazine, © 1996-1998, 2006 Xinhua
News Agency, © 2001, 2004, 2005, 2007, 2009, 2010 Trustees of the University
of Pennsylvania
Contact:
ldc@ldc.upenn.edu © 2010
Linguistic Data Consortium ,
Trustees of the University of Pennsylvania . All
Rights Reserved. |