This publication contains the Chinese Penn Treebank Project Corpus Final Release, produced by:
Principal Investigators:
Martha Palmer, Mitch Marcus, Tony Kroch
Consultants:
Martha Palmer, Mitch Marcus,
Tony Kroch, Shizhe Huang,
Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc
Project Managers and Guideline Designers:
Fei Xia, Nianwen Xue
Annotators:
Fu-Dong Chiou, Nianwen Xue
Programming support:
Zhibiao Wu
Published by the Linguistic Data Consortium (LDC), catalog number LDC2000T48, isbn 1-58563-187-6. The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at The Chinese Treebank Project.
| Size: | About 100K words, 4185 sentences, 325 data files |
| Source: | 325 articles from Xinhua newswire between 1994 and 1998 |
| Coding: | GB code |
| Format: | Same as the English Penn Treebank except that we keep the original file information such as "DOCNO" and "DATE" in the data file. |
| Annotation: | All the files are annotated at least twice, the 1st-pass is done by one annotator, and the resulting files are checked by the second annotator (2nd-pass). |
| SGML: | All data files validate against chtb.dtd using nsmls. |
The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in filelist.tbl which provides some annotator and historical information.
Portions Copyright © 1994-1998, Xinhua News Agency