The Chinese Treebank 2.0 was produced by:
Principal Investigators:
Martha Palmer, Mitch Marcus, Tony Kroch
Consultants:
Martha Palmer, Mitch Marcus,
Tony Kroch, Shizhe Huang,
Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc
Project Managers and Guideline Designers:
Fei Xia, Nianwen Xue
Annotators:
Fu-Dong Chiou, Nianwen Xue
Programming support:
Zhibiao Wu
Introduction
Published by the Linguistic Data Consortium (LDC), catalog number
LDC2001T11 and ISBN 1-58563-204-X.
The Chinese Penn Treebank Project started in
Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with
syntactic bracketing. More information is available at The Chinese
Treebank Project. Chinese Treebank 2.0 supersedes and replaces the
Chinese Penn Treebank Final Release (LDC2000T48 ISBN 1-58563-187-6).
Data
| Size: |
About 100K words, 325 data files |
| Source: |
325 articles from Xinhua newswire between 1994
and 1998 |
| Coding: |
GB code |
| Format: |
Same as the UPenn English Treebank except that we
keep some original file information was retained such as "SRCID"
and "DATE" in the data file. |
| Annotation: |
All the files are annotated at least twice, the
first-pass is done by one annotator, and the resulting files are
checked by the second annotator (second-pass). |
| SGML: |
All data files validate against chtb.dtd using nsmls. |
The files are located in the data subdirectory and are sequentially named as
follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross
reference in file.tbl which provides some annotator and
historical information.
More extensive documentation, including samples of the annotated
data, can be found at
http://www.cis.upenn.edu/~chinese.
Copyright
Portions © 1994-1998 Xinhua News Agency, © 2001 Trustees of the University of Pennsylvania |