ARABIC PART-OF-SPEECH/MORPHOLOGICAL ANALYSIS TAGGING The Penn Arabic Treebank uses a level of annotation more accurately described as morphological analysis than as part-of-speech tagging. In October 2001, the decision was taken to use Tim Buckwalter's morphological analyzer and main lexicon, which currently contains over 77,800 stem entries representing some 45,000 lexical items. A DESCRIPTION OF TIM BUCKWALTER'S ARABIC MORPHOLOGICAL ANALYSIS TOOL The Arabic morphological analysis and part-of-speech tagging was performed with the Buckwalter Arabic Morphological Analyzer, an open-source software package distributed by the LDC. The source code of the program and a full technical description can be downloaded for free from the LDC website: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49. What follows is a brief description of the Arabic morphology analysis algorithm and the structure of lexicon entries. The Arabic morphology analysis is based on these assumptions: 1. Arabic words are composed of three elements: prefix, stem, and suffix 2. Prefix length is 0-4 characters 3. Stem length is 1-infinite characters 4. Suffix length is 0-6 characters Given these rules, an Arabic word can be segmented as follows (using wbAlErbyp as an example): Prefix Stem Suffix wbAlErbyp wbAlErby p wbAlErb yp wbAlEr byp wbAlE rbyp wbAl Erbyp wbA lErbyp w bAlErbyp w bAlErby p w bAlErb yp w bAlEr byp w bAlE rbyp w bAl Erbyp w bA lErbyp wb AlErbyp wb AlErby p wb AlErb yp wb AlEr byp wb AlE rbyp wb Al Erbyp wb A lErbyp wbA lErbyp wbA lErby p wbA lErb yp wbA lEr byp wbA lE rbyp wbA L Erbyp wbAl Erbyp wbAl Erby p wbAl Erb yp wbAl Er byp wbAl E rbyp Arabic dictionary look-up consists of asking, for each segmentation: 1. does the prefix exist in the lexicon of prefixes? 2. if so, does the stem exist in the lexicon of stem? 3. if so, does the suffix exist in the lexicon of suffixes? Note that the dictionary of prefixes contains not only the individual prefixes (wa-, fa-, li-, Al-, bi-, etc.) but all valid concatenations of these as well (waAl-, biAl-, wabiAl-, etc). The same applies to the dictionary of suffixes: (-ap, -At, -Ani, -athu, -Athum, -Anihi, -tumuwhA, etc). Here are some sample entries from the dictionary of prefixes: wl wali NPref-Li and + for/to wa/CONJ+li/PREP+ ll lil NPref-Li to/for + the li/PREP+Al/DET+ wll walil NPref-Lil and + to/for + the wa/CONJ+li/PREP+Al/DET+ wbAl wabiAl NPref-BiAl and + with/by the wa/CONJ+bi/PREP+Al/DET+ The first column contains the actual string that we look up, whereas the second column has the vocalized version of the same string. The third column has the morphological category (whose function is explained further below). The fourth column has the corresponding English glosses and contains part-of-speech information for the constituent morphemes. Here are some sample entries from the dictionary of stems (lines beginning with ";; " contain the lemma ID string): ;; Earabiy~_1 Erby Earabiy~ N/ap Arab Earabiy~/NOUN Erb Earab N Arabs Earab/NOUN Erby Earabiy~ N/ap Arab Earabiy~/ADJ Erb Earab N Arab Earab/ADJ ;; Earabiy~_2 Erby Earabiy~ N-ap Arabic;Arab Earabiy~/ADJ ;; Earabiy~_3 Erby Earabiy~ N0 Arabi Earabiy~/NOUN_PROP ;; Earabiy~ap_1 Erby Earabiy~ NapAt Arabic (language) Earabiy~/NOUN The following are sample entries from the dictionary of suffixes: p ap NSuff-ap [fem.sg.] +ap/NSUFF_FEM_SG Ak Aka NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ka/POSS_PRON_2MS Ak Aki NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ki/POSS_PRON_2FS If all three word elements (prefix, stem, suffix) are found in their respective lexicons, we then use their respective morphological categories (the string in column 3) to determine whether they are compatible. We ask: 1. is the morphological category of the prefix compatible with the morphological category of the stem? (i.e., is the combination found in the list of compatible prefix-stem morphological categories?) 2. if so, is the morphological category of the prefix compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible prefix-suffix morphological categories?) 3. if so, is the morphological category of the stem compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible stem-suffix morphological categories?) If the answer to the last question is "yes" then the morphological analysis is valid. Example: INPUT STRING: ???? LOOK-UP WORD: wSfh SOLUTION 1: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS (GLOSS): + describe/characterize + he/it it/him SOLUTION 2: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS (GLOSS): + prescribe/give a prescription to + he/it it/him SOLUTION 3: (waSofh) [waSof_1] waSof/NOUN+hu/POSS_PRON_3MS (GLOSS): + description/portrayal/characterization + its/his SOLUTION 4: (waSofh) [waSof_2] waSof/NOUN+hu/POSS_PRON_3MS (GLOSS): + characteristic + its/his SOLUTION 5: (waSaf~ahu) [Saf~-u_1] wa/CONJ+Saf~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS (GLOSS): and + arrange/classify + he/it it/him SOLUTION 6: (waSaf~h) [Saf~_1] wa/CONJ+Saf~/NOUN+hu/POSS_PRON_3MS (GLOSS): and + line/row/class + its/his Solution #1 was found to be valid because: 1. All 3 components(null)+wSf+h exist in their respective lexicons (note that there is a literal entry for the null prefix): (null) (null) Pref-0 (null) wSf waSaf PV describe;characterize h ahu PVSuff-ah he/it it/him +a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS 2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables: 1. "Pref-0 PV" (listed in the table of compatible prefix-stem morphological categories) 2. "PV PVSuff-ah" (listed in the table of compatible stem-suffix morphological categories) 3. "Pref-0 PVSuff-ah" (listed in the table of compatible prefix-suffix morphological categories) Solution #6 was found to be valid because: 1. All 3 components w+Sf+h exist in their respective lexicons: w wa Pref-Wa and wa/CONJ+ Sf Saf~ Ndu line;row;class H h NSuff-h its/his +hu/POSS_PRON_3MS 2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables: 1. "Pref-Wa Ndu" (listed in the table of compatible prefix-stem morphological categories) 2. "Ndu NSuff-h" (listed in the table of compatible stem-suffix morphological categories) 3. "Pref-Wa NSuff-h" (listed in the table of compatible prefix-suffix morphological categories) The lexicon of stems used in the morphology analysis contains 83,811 entries and 39,321 lemmas (as of Dec. 20, 2002). AFP ARABIC POS TAGS ABBREV ADJ ADV CONJ DEM_PRON_F DEM_PRON_FD DEM_PRON_FS DEM_PRON_MD DEM_PRON_MP DEM_PRON_MS DET EMPHATIC_PARTICLE EXCEPT_PART FUNC_WORD FUT INTERJ INTERROG_PART IV1P IV1S IV2D IV2FS IV2MP IV2MS IV3FD IV3FP IV3FS IV3MD IV3MP IV3MS IVSUFF_DO:1P IVSUFF_DO:1S IVSUFF_DO:2MP IVSUFF_DO:2MS IVSUFF_DO:3D IVSUFF_DO:3FS IVSUFF_DO:3MP IVSUFF_DO:3MS IVSUFF_SUBJ:2FS_MOOD:SJ IVSUFF_SUBJ:D_MOOD:I IVSUFF_SUBJ:D_MOOD:SJ IVSUFF_SUBJ:FP IVSUFF_SUBJ:MP_MOOD:I IVSUFF_SUBJ:MP_MOOD:SJ NEG_PART NO_FUNC NON_ALPHABETIC NON_ARABIC NOUN NOUN_PROP NSUFF_FEM_DU_ACCGEN NSUFF_FEM_DU_ACCGEN_POSS NSUFF_FEM_DU_NOM NSUFF_FEM_DU_NOM_POSS NSUFF_FEM_PL NSUFF_FEM_SG NSUFF_MASC_DU_ACCGEN NSUFF_MASC_DU_ACCGEN_POSS NSUFF_MASC_DU_NOM NSUFF_MASC_DU_NOM_POSS NSUFF_MASC_PL_ACCGEN NSUFF_MASC_PL_ACCGEN_POSS NSUFF_MASC_PL_NOM NSUFF_MASC_PL_NOM_POSS NSUFF_MASC_SG_ACC_INDEF NUM NUMERIC_COMMA PART POSS_PRON_1P POSS_PRON_1S POSS_PRON_2FS POSS_PRON_2MP POSS_PRON_2MS POSS_PRON_3D POSS_PRON_3FP POSS_PRON_3FS POSS_PRON_3MP POSS_PRON_3MS PREP PRON_1P PRON_1S PRON_2FS PRON_2MP PRON_2MS PRON_3D PRON_3FP PRON_3FS PRON_3MP PRON_3MS PUNC PVSUFF_DO:1P PVSUFF_DO:1S PVSUFF_DO:3D PVSUFF_DO:3FS PVSUFF_DO:3MP PVSUFF_DO:3MS PVSUFF_SUBJ:1P PVSUFF_SUBJ:1S PVSUFF_SUBJ:2FS PVSUFF_SUBJ:2MP PVSUFF_SUBJ:3FD PVSUFF_SUBJ:3FP PVSUFF_SUBJ:3FS PVSUFF_SUBJ:3MD PVSUFF_SUBJ:3MP PVSUFF_SUBJ:3MS REL_PRON REL_ADV RESULT_CLAUSE_PARTICLE SUBJUNC VERB_IMPERFECT VERB_PERFECT VERB_PASSIVE AFP POS COVERAGE STATISTICS The AFP Corpus contains 140,265 tokens, of which 16,455 are punctuation, numbers, and Latin strings, and 123,810 are Arabic word tokens. Punctuation, Numbers, Latin strings 16,455 Arabic Word Tokens 123,810 TOTAL 140,265 Of the 123,810 Arabic word tokens, 112,215 (90.63%) were provided with an accurate morphological analysis and POS tag, and 11,595 (09.37%) Arabic word tokens were judged to be inaccurate and flagged with a Comment describing the nature of the inaccuracy. Accurately parsed Arabic Word Tokens 112,215 90.63% Commented Arabic Word Tokens 11,595 09.37% TOTAL 123,810 100.00% Of the 11,595 Comments, the most frequently identified problems are the inaccurate parsing of proper names (28.47%) and the improper tagging of adjectives (18.30%). A large group of Comments (29.16%) could not be interpreted automatically (via scripting languages such as Perl) and was classified as Miscellaneous. ARABIC POS QUALITY CONTROL COMPARISON, 6-26-02 Five files with a total of 853 words (and a varying number of POS choices per word) were each tagged independently by five annotators for a quality control comparison of POS annotators. Out of the total of 853 words, 128 show some disagreement. All five annotators agreed on 85% of the words; the pairwise (between 2 annotators) agreement rate is at least 92.2%. There are a total of 82 words where four annotators agreed and only one disagreed. Of those, 55 are cases of "no selection" having been chosen from among the POS choices, due to one annotator's definition of good-enough-match differing from all of the others'. The annotators have since reached agreement on which cases are truly "no selection", and thus the rate of this disagreement should fall markedly in future POS files, raising the rate of overall agreement. In addition, we plan to revise the same five files to create a gold standard, which in the future may be used to evaluate and guide new annotators during their training period. AFP POS ANNOTATORS: Current annotators: Wigdan EL MEKKI Mohamed MANSOUR Zohra BENTAOUIT Rachida FATHALLAH Dalel ZAKHARY Tasneem GHANDOUR Ichraf AMGHOUZ Niama LAADIOUI Past annotators: Fatima EL HIMYANI Alexa FIRAT Sarah TLILI Gordon WITTY