This publication contains the Chinese Penn Treebank Project Corpus Final Release, produced by:
Martha Palmer, Mitch Marcus, Tony Kroch
Martha Palmer, Mitch Marcus,
Tony Kroch, Shizhe Huang,
Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc
Project Managers and Guideline Designers:
Fei Xia, Nianwen Xue
Fu-Dong Chiou, Nianwen Xue
Published by the Linguistic Data Consortium (LDC), catalog number LDC2000T48, isbn 1-58563-187-6. The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at The Chinese Treebank Project.
|Size:||About 100K words, 4185 sentences, 325 data files|
|Source:||325 articles from Xinhua newswire between 1994 and 1998|
|Format:||Same as the English Penn Treebank except that we keep the original file information such as "DOCNO" and "DATE" in the data file.|
|Annotation:||All the files are annotated at least twice, the 1st-pass is done by one annotator, and the resulting files are checked by the second annotator (2nd-pass).|
|SGML:||All data files validate against chtb.dtd using nsmls.|
The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in filelist.tbl which provides some annotator and historical information.
Portions Copyright © 1994-1998, Xinhua News Agency