Introduction
Chinese Dependency Treebank 1.0 was developed by the Harbin
Institute of Technology's Research
Center for Social Computing and Information Retrieval (HIT-SCIR). It contains
49,996 Chinese sentences (902,191 words) randomly selected from People's Daily
newswire stories published between 1992 and 1996 and annotated with syntactic
dependency structures.
Data
Ill-formed or short sentences were eliminated from the randomly-selected sentences
prior to annotation. The data was segmented and annotated for part of speech
(POS), syntactic structures, verb subclasses and noun compounds.Word segmentation
and POS tagging were accomplished automatically using statistical models trained
on a larger, annotated corpus of People's Daily newswire stories. Humans manually
annotated the syntactic structures and corrected word segmentation errors. POS
tags were not corrected.
The data is provided in the format of CoNLL-X and in UTF-8.
One line presents information for one word.
An empty line indicates the end of a sentence.
Each line contains 10 columns separated with a tab.
Samples
Please click follow this
link
for a sample of the data.
Updates
None at this time.
Content Copyright
Portions © 1992-1996 People's Daily, © 2012 Harbin Institute of
Technology, Research Center for Social Computing and Information Retrieval,
© 2012 Trustees of the University of Pennsylvania
|