PENN ARABIC TREEBANK The Penn Arabic Treebank project started with the assumption that the existing methodological principles and accumulated experience and 'wisdom' of the previous Penn treebanks could and should be very helpful in jumpstarting the Arabic Treebank. A team of three (Mohamed Maamouri, Ann Bies, and Hubert Jin) started work in November 2001 with the objective of completing the Treebank annotation of an initial 100K corpus of AFP newswire. The study and analysis of Modern Standard Arabic syntax and of the appropriateness and adaptability of the Penn English Treebank methodology to the Arabic language by Mohamed Maamouri and Ann Bies surprisingly showed that existing Penn English Treebank methodology and tags worked satisfactorily for the main Arabic syntactic issues and could therefore be used with few changes. For the most part, our syntactic/predicate-argument annotation of newswire Arabic follows the bracketing guidelines for the Penn English Treebank where possible. Basic sentence structure, node labels and functional tags, arguments/adjuncts, coordination, and empty categories are treated in essentially the same way. Draft guidelines can be found in guidelines-TB-1-28-03.pdf. Some points where the Arabic Treebank differs from the Penn English Treebank: * Arabic subjects are analyzed as VP internal, following the verb. The subject (labeled NP-SBJ) is inside VP after verb, and is frequently pro-drop (NP-SBJ * ). If the lexical subject precedes the verb, it is labeled NP-TPC (topicalized) and traced to (NP-SBJ *T* ) following the verb. * Matrix clause (S) coordination is possible and frequent. * The function of NP objects of transitive verbs is directly shown as NP-OBJ. * Co-reference is shown always on the node label, never on the empty category token itself. * Gapping co-reference is always shown as '=' indexing, for both the template and the subsequent gap filling items. CLITICIZATION The prevalence of cliticization in Arabic sentences of determiners, prepositions, conjunctions, and pronouns led to a necessary difference in tokenization between the POS files and the treebanking files. Clitics that play a role in the syntactic structure are split off into separate tokens (e.g., object pronouns cliticized to verbs, subject pronouns cliticized to complementizers, cliticized prepositions, etc.). Clitics that do not affect the structure are not separated (e.g., determiners). There is a file containing the rules guiding the separation of the clitics: appendix/bin/func-subst2.list SAMPLE BRACKETED TREE (S (CONJ [w]) (VP (IV3MS+VERB_IMPERFECT [yblg]) (NP-SBJ (NOUN [Edd]) (NP (DET+NOUN+NSUFF_MASC_PL_ACCGEN [Alm$rdyn]))) (PP-LOC (PREP [fy]) (NP (NOUN [kwntyp]) (NP (NOUN_PROP [lws]) (NOUN_PROP[Anjlys])))) (PP (PREP [nHw]) (NP (NP (QP (NUM [48]) (NOUN [Alf])) (NOUN [$xS])) (SBAR (WHNP-1 0) (S (PP-PRD (PREP [byn]) (NP (NP (PRON_3MP [hm])) (NP-1 *T*))) (NP-SBJ (QP (NOUN+NSUFF_FEM_SG [tsEp]) (NOUN [AlAf])) (NOUN [Tfl])))))) (SBAR-TMP (CONJ [bynmA]) (S (VP (PRT (NEG_PART [lA])) (IV3FS+VERB_IMPERFECT [ttsE]) (NP-SBJ (DET+NOUN [AlmrAkz]) (DET+NOUN+NSUFF_FEM_SG [AlxASp])) (PP (PREP [l]) (NP (NOUN [{stqbAl]) (POSS_PRON_3MP [hm])))) (PP (ADV [swY]) (PREP [l]) (NP (NUM [00631]) (NOUN [sryr]))))))) (PUNC [.])) TAGLIST Constituent tags: S sentence NP noun phrase VP verb phrase PP prepositional phrase SBAR S-bar (subordinate clause, complementizer or WH- and sentence) SBARQ S-bar that is a question SQ S that is a question NX noun head in certain complex coordination contexts PRN parenthetical PRT particle QP quantity phrase (multi-word numbers) ADJP adjective phrase ADVP adverb phrase FRAG fragment WHNP WH- noun phrase WHPP WH- prepositional phrase WHADJP WH- adjective phrase WHADVP WH- adverb phrase CONJP conjunction phrase (multi-word conjunction) INTJ interjection NAC Not-A-Constituent (mostly rightward moved conjuncts with conjunction) UCP Unlike-Coordinated-Phrase (dominates coordination of NP and PP, e.g.) X unknown, technical problem, etc. Function (dash) tags: -SBJ subject -OBJ object (ONLY BRAND NEW TAG FOR ARABIC!) -TPC topicalized -PRD predicate -PRP purpose -CLR CLosely-Related (non dative PP argument of verb, for the most part) -LOC locative -DIR directional -MNR manner -TMP temporal -ADV adverbial (for NP, S and SBAR) -LGS LoGical Subject (NP object of by PP in passives) -NOM nominal (for S and SBAR) -DTV dative -VOC vocative -BNF benefactive -EXT extent -CLF cleft (as in, it-cleft) -HLN headline -TTL title TREEBANK PARSING ANNOTATORS: Wigdan EL MEKKI Mohamed MANSOUR Tasneem GHANDOUR Ichraf AMGHOUZ Niama LAADIOUI