|

|
|
Prague Czech-English Dependency Treebank 2.0
| |
| Item Name: | Prague Czech-English Dependency Treebank 2.0 |
| Authors: | Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, Zdeněk Žabokrtský |
| LDC Catalog No.: | LDC2012T08 |
| ISBN: | 1-58563-616-9 |
| Release Date: | Jun 15, 2012 |
| Data Type: | text |
| Data Source(s): | newswire |
| Application(s): | information extraction, information retrieval, language modeling, language teaching, machine translation, parsing, tagging |
| Language(s): | Czech, English |
| Language ID(s): | ces, eng |
| Distribution: | 1 DVD |
| Member fee: | $0 for 2012 members |
| Non-member Fee: | US $100.00 |
| Reduced-License Fee: | N/A |
| Extra-Copy Fee: | US $100.00 |
| Non-member License: | yes |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Jan Hajič, et al. 2012 Prague Czech-English Dependency Treebank 2.0 Linguistic Data Consortium, Philadelphia |
|
Introduction
Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the
Institute of Formal and Applied Linguistics
at Charles University in Prague, Czech Republic.
It is a corpus of Czech-English parallel resources translated, aligned and manually
annotated for dependency structure, semantic labeling, argument structure,
ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0
(LDC2004T25) by adding English newswire texts so that it now contains over two million
words in close to 100,000 sentences.
Data
The principal new material in PCEDT 2.0 is the inclusion of the entire Wall
Street Journal data from Treebank-3 (LDC99T42).
Not included from PCEDT 1.0 are the Reader's Digest material, the Czech
monolingual corpus, and the English-Czech dictionary.
Each section is enhanced with a comprehensive manual linguistic annotation
in the Prague Dependency Treebank style (LDC2006T01,
Prague Dependency Treebank 2.0). The main features of this annotation style
are:
- dependency structure of the content words and coordinating and similar structures
(function words are attached as their attribute values)
- semantic labeling of content words and types of coordinating structures
- argument structure, including an argument structure ("valency") lexicon for both
languages
- ellipsis and anaphora resolution
This annotation style is called tectogrammatical annotation,
and it constitutes the tectogrammatical layer in the corpus.
Please consult the PCEDT website for more information and documentation.
Samples
Please follow this
link
for a sample of the data included.
Updates
None at this time.
Content Copyright
Portions © 1987-1989 Dow Jones & Company, Inc., © 2002-2012 Charles University in Prague,
Institute of Formal and Applied Linguistics, © 1999, 2004, 2012 Trustees of the University of Pennsylvania
|
|
|