TOWARDS A FASTER WAY OF CORPUS ANNOTATION: ANNOTATORS ARE SMART, BUT SLOW Jan Hajic, Barbora Hladka, Petr Pajas Charles University, Prague e-mail: {hajic, hladka, pajas}@ufal.mff.cuni.cz The importance of annotated (text) corpora increases continuously. The text size and richness of linguistic information that should be added into texts are two reasons why the building of a representative annotated corpus takes so much effort, time, and money. By the capacity itself (about 1.8M tokens), the Prague Dependency Treebank version 1.0 (PDT, http://ufal.mff.cuni.cz/pdt) becomes the second largest annotated corpus among the annotated corpora available all over the world. Our intention is to employ the experience acquired during our five-year work on the PDT in order to formulate a new annotation strategy based on a significantly lower amount of manual work which is made possible by a higher proportion of the proposed automatic procedures. Let us imagine that, having an automatic morphological tagger, we want to continue annotating some more texts. Let us have it annotated on the syntactic level manually (supposedly correctly) and let us further have the goal to dispose of the manual morphological annotation completely, by replacing it by the automatic tagger and a post-tagging cross-correction (syntax vs. morphology) procedure (strictly a knowledge-based one) which uses the syntactic annotation as an additional information allowing to correct the automatic tagging results. The hidden Markov model (HMM) and feature-based (FB) taggers we use are trained on a subset of PDT 1.0 annotated only on the morphological layer (469K tokens); for testing purposes, we use data annotated on both layers (1,255K tokens). The test data are tagged by taggers so for each token we have at our disposal two automatically assigned tags (in addition to the manually assigned tag and syntactic information). The overall tagging accuracy of taggers on test data differs very slightly: 92.62% (FB) vs. 92.75% (HMM). With the complete set of 'syntax vs. morphology' checking rules designed independently for the PDT 1.0 post-annotation corrections, we detect 34,294(2.73%)/30,962(2.47%) tokens (from the test data) the automatically assigned tags of which (by FB/HMM tagger) break the rules. Since the rules are based on mutual relationship a node and its governing node are liable to, the automatically assigned tags of at least one of the two tags are wrong. Preliminary experiments were performed with the rules checking case-agreement within prepositional phrases and the agreement in case, gender, number between an adjective and the governing noun, pronoun, adjective or numeral. By filtering the governor/dependent tag pairs through the possibilities offered by a morphological analyzer, we can correct 3.14/3.17% of incorrectly tagged tokens (relative, by HMM/FB tagger) bringing the accuracy to 92.85/92.98 %, respectively. At the workshop we will present a detailed analysis of the errors taggers produced, a detailed description of a checking rule scope, and the post-tagging improvement provided when using ALL of the available checking rules.