Chinese Penn Treebank Project Corpus


Introduction

This publication contains the Chinese Penn Treebank Project Corpus Final Release, produced by:

Principal Investigators:
   Martha Palmer, Mitch Marcus, Tony Kroch

Consultants:
   Martha Palmer, Mitch Marcus,
   Tony Kroch, Shizhe Huang,
   Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc

Project Managers and Guideline Designers:
   Fei Xia, Nianwen Xue

Annotators:
   Fu-Dong Chiou, Nianwen Xue

Programming support:
   Zhibiao Wu

Published by the Linguistic Data Consortium (LDC), catalog number LDC2000T48, isbn 1-58563-187-6. The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at The Chinese Treebank Project.

Data

Size: About 100K words, 4185 sentences, 325 data files
Source: 325 articles from Xinhua newswire between 1994 and 1998
Coding: GB code
Format: Same as the English Penn Treebank except that we keep the original file information such as "DOCNO" and "DATE" in the data file.
Annotation: All the files are annotated at least twice, the 1st-pass is done by one annotator, and the resulting files are checked by the second annotator (2nd-pass).
SGML: All data files validate against chtb.dtd using nsmls.

The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in filelist.tbl which provides some annotator and historical information.

Copyright

Portions Copyright © 1994-1998, Xinhua News Agency