Introduction
This file contains documentation on the Korean English Treebank Annotations,
Linguistic Data Consortium (LDC) catalog number LDC2002T26 and ISBN
1-58563-236-8.
This corpus consists of 33 texts originally written in Korean and
translated into English for the purpose of language training in a military
setting. The conversations are not authentic dialogues but were
constructed for pedagogical purposes. The texts were made available for linguistic research by the
Defense Language Institute (DLI). They were delivered on paper to the
Institute for Research in Cognitive Science (IRCS) at the University of
Pennsylvania, where they were converted to digital form using the KSC
5601 character set encoding (also known as KS X 1001 Wansung).
Both the Korean and English texts are presented with complete
Treebank annotation which was done manually at IRCS, including
syntactic constituent bracketing and part-of-speech (POS) tagging.
Further documentation about the parsing and POS specifications used in
these annotations can be found on the Korean NLP web site.
Data
There are 66 data files: 33 for Korean and 33 for
English. The text files mostly contain sets of question and answer sentences.
A full, unannotated sentence is presented first, on a single line with
an initial semi-colon character ";" -- the first token on such lines
(the string preceding the first space character on the line) is a
sentence-identifier tag that matches the English and Korean versions
of the sentence. The parsed/POS-tagged annotation of the sentence
follows on subsequent lines.
Updates
There are no updates at this time.
Content Copyright
Portions (c) 2001-2002 CoGenTex, Inc., Trustees of the University
of Pennsylvania |