Introduction
The Prague Dependency Treebank Version 1.0:
- Morphologically and syntactically annotated Czech data, 1.8MW
- Czech-English parallel Corpus, aligned, 0.9MW/1MW
- Czech raw texts (newspaper and journals), over 30MW
- Czech NLP tools (morphology, tagging)
- General annotation tools (tree editors, tree viewer)
(abridged version of the part of paper: E. Hajicova. Dependency-Based
Underlying-Structure Tagging of a Very Large Czech Corpus)
Since a group of Czech
linguists (Institute of Formal and Applied
Linguistics, Institute of Theoretical and
Computational Linguistcs) from Charles
University in Prague and Masaryk
University in Brno first formulated the Czech National Corpus, it has been quite clear to all of us that for the
outcome of our project to have broader relevance and multifaceted usage, we
cannot confine ourselves to a mere compilation of a very large corpus of Czech
texts. We have been aware that in order to make the corpus really useful for
future users -- be they linguists or developers of natural language processing
systems of any kind -- we have to design annotation schemes and develop tools
that will allow us to add as much linguistic information as possible. Having
the advantage of a long and fruitful tradition of theoretical and computational
linguistics and inspired by the research resulting in the Penn Treebank, the
project group decided to build the Prague Dependency Treebank (PDT).
Data
The following three points are characteristic for the theory underlying the
PDT, fully visible at the highest, tectogrammatical level:
(i) Its theoretical background is a dependency-based syntax (handling
the sentence structure as concentrated around the verb and its valency, but
containing a further dimension, namely coordination). Among the reasons for the
choice of a dependency-based syntax, we primarily stress its
relative economy and perspicuous, immediate correspondence to the empirical
data.
(ii) The nodes of the dependency tree (more precisely, of a
multidimensional network) are labeled by complex symbols consisting of
lexical, morphological and syntactic parts. Thus, the label of every node
contains symbols expressing all of the information contained in the grammatical
position of this word and is relevant for a semantic (semantico-pragmatic) interpretation. This makes the output representations, or
the trees of our treebank, not only useful for practical applications such as
parsing, but also for its inclusion into an integrated theoretical description
encompassing all layers from the outer (phonetic or graphemic) shape of the
sentence to its semantico-pragmatic representation, be it in the form of
truth-conditionally based intensional semantics or in that of a framework
paying more attention to the embedding of the sentence in context.
(iii) The dependency tree is understood as projective. Its
relationships to the morphemic representation of the sentence (a string of
symbols, the order of which corresponds to the surface word order) are handled
by means of specific rules.
Prague Dependency Treebank as a project
The Prague Dependency Treebank (PDT) is a long-term project with two major
phases. In the first phase (1996-2000), the morphological and syntactic
analytic layers of annotation have been completed and made together with the
preview of tectogrammatical layer annotation available as PDT 1.0. During the
second phase (2000 - 2004, Center for
Computational Linguistics), the tectogrammatical layer of annotation will
proceed and the PDT 2.0 will be available upon completion.
The structure of the Prague Dependency Treebank (PDT) corresponds to a
three-layer structure annotated corpus of Czech as a representative of inflectionally
rich, free word-order languages:
Morphological layer (lowest) - Full morphological annotation
Analytic layer (middle) - Superficial (surface) syntactic annotation using
dependency treebank with a level conceptually close to the syntactic annotation used
in the Penn
Treebank
Tectogrammatical layer (highest) - Level of linguistic meaning
Text Sources
The electronic text sources have been provided by the Institute of the Czech National Corpus.The
text material contains samples from the following sources:
- Lidové Noviny
(daily newspapers), 1991, 1994, 1995
- Mladá fronta Dnes (daily
newspapers), 1992
- Ceskomoravský Profit (business weekly), 1994
- Vesmír (scientific magazine), Academia Publishers, 1992,
1993
There is also a parallel Czech English corpus. Drawn from Reader's Digest
1993-1996, it consists of 450 articles, 53,117 parallel sentences, 1,010,346
English tokens and 877,658 Czech tokens
Inner format of PDT
There are two internal formats employed in PDT: FS and CSTS.
The former is an older format, still heavily used by some treebank tools. The
latter, more general SGML-based
encoding, is meant as the main PDT format (in the future, it will be followed
by an XML version, probably already for PDT 2.0). See the description of the
FS file format and documentation of the CSTS document type definition
(csts.dtd).
Prague Dependency Treebank Version 1.0
PDT 0.5 ("half
through") was released in 1998 and contains 456,705 tokens (words and
punctuation) in 26,610 sentences. PDT 1.0 contains about three times more
tokens and sentences than PDT 0.5. It is completely manually-annotated on the
morphological and analytical levels and includes a preview of
tectogrammatically annotated data as well.
Future
The Prague Dependency Treebank Version 2.0 will add the
tectogrammatical layer of annotation to PDT 1.0. It will be available with a
reduced amount of data as preliminary "Version 1.5" during 2002. The final
data volume will be reached at the end of 2004.
Support
The PDT 1.0 has been supported by the following grants and
projects
The PDT 2.0 will be supported by the project
Updates
There are no updates at this time.
Copyright
Portions Copyright 1993-1996, Reader's Digest
Portions Copyright 1991, 1994, 1995 Lidové noviny daily newspapers
Portions Copyright 1992, Mladá fronta Dnes daily newspapers
Portions Copyright 1994 Ceskomoravský Profit business weekly
Portions Copyright 1992-1993, Vesmír scientific
magazine, Academia Publishers
Portions Copyright 1996-2001, Institute of Formal and Applied
Linguistics and Center for
Computational Linguistics Faculty of Mathematics and Physics Charles University
|