Introduction
Prague Czech-English Dependency Treebank 1.0 (PCEDT) was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T25 and ISBN 1-58563-321-6.
This corpus was developed at the Center
for Computational Linguistics in cooperation with the Institute of Formal and Applied
Linguistics.
PCEDT 1.0 is a corpus of Czech-English parallel resources suitable for experiments in
machine translation, with a special emphasis on dependency-based
(structural) translation (with evaluation data provided for
Czech-to-English systems).
Data
The core part of PCEDT 1.0 is a Czech translation of
21,600 English sentences from the Wall Street Journal, which are part of the
Penn Treebank corpus. Sentences of the Czech translation were automatically morphologically
annotated and parsed into two levels (analytical and tectogrammatical)
of dependency structures introduced in the theory of Functional
Generative Description and closely related to the Prague Dependency Treebank project.
The original English sentences were transformed from the Penn Treebank
phrase-structure trees into dependency representations.
A heldout (development and evaluation) set of 515 sentence pairs was selected and
manually annotated on tectogrammatical level in both Czech and English;
for the purposes of quantitative evaluation, this set has been
retranslated from Czech into English by four different translation
companies.
PCEDT 1.0 also contains a parallel Czech-English
corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel
sentences, and a large monolingual corpus of Czech (2.4 M sentences).
The included Czech-English translation dictionary consists of 46,150 translation
pairs in its lemmatized version and 496,673 pairs of word forms,
where for each entry-translation pair all corresponding word form
pairs have been generated. Also included is an English-Czech dictionary
provided by Milan Svoboda under GNU/FDL license; this dictionary
contains multi-word translations in 115,929 translation pairs.
The next version of PCEDT intends to translate the whole
Wall Street Journal part of the Penn Treebank, and to include
reference retranslations for Czech. As the manual for tectogrammatical
annotation of English gets created, the proportion of manually annotated data will increase.
Sponsorship
PCEDT 1.0 has been supported by the following grants and projects:
Updates
There are no updates available at this time.
Content Copyright
Portions © 2004 Trustees of the University of Pennsylvania,
© 1988-1989 Wall Street Journal, © 1993-1996 Reader's Digest,
© 1991-1995 Lidové noviny, © 2004 Milan Svoboda,
© 2002-2004 Center for Computational Linguistics, Charles
University in Prague |