ARABIC PART-OF-SPEECH/MORPHOLOGICAL ANALYSIS TAGGING The Penn Arabic Treebank uses a level of annotation more accurately described as morphological analysis than as part-of-speech tagging. In October 2001, the decision was taken to use Tim Buckwalter's morphological analyzer and main lexicon, which currently contains over 77,800 stem entries representing some 45,000 lexical items. A DESCRIPTION OF TIM BUCKWALTER'S ARABIC MORPHOLOGICAL ANALYSIS TOOL The Arabic morphological analysis and part-of-speech tagging was performed with the Buckwalter Arabic Morphological Analyzer, an open-source software package distributed by the LDC. The source code of the program and a full technical description can be downloaded for free from the LDC website: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49. What follows is a brief description of the Arabic morphology analysis algorithm and the structure of lexicon entries. The Arabic morphology analysis is based on these assumptions: 1. Arabic words are composed of three elements: prefix, stem, and suffix 2. Prefix length is 0-4 characters 3. Stem length is 1-infinite characters 4. Suffix length is 0-6 characters Given these rules, an Arabic word can be segmented as follows (using wbAlErbyp as an example): Prefix Stem Suffix wbAlErbyp wbAlErby p wbAlErb yp wbAlEr byp wbAlE rbyp wbAl Erbyp wbA lErbyp w bAlErbyp w bAlErby p w bAlErb yp w bAlEr byp w bAlE rbyp w bAl Erbyp w bA lErbyp wb AlErbyp wb AlErby p wb AlErb yp wb AlEr byp wb AlE rbyp wb Al Erbyp wb A lErbyp wbA lErbyp wbA lErby p wbA lErb yp wbA lEr byp wbA lE rbyp wbA L Erbyp wbAl Erbyp wbAl Erby p wbAl Erb yp wbAl Er byp wbAl E rbyp Arabic dictionary look-up consists of asking, for each segmentation: 1. does the prefix exist in the lexicon of prefixes? 2. if so, does the stem exist in the lexicon of stem? 3. if so, does the suffix exist in the lexicon of suffixes? Note that the dictionary of prefixes contains not only the individual prefixes (wa-, fa-, li-, Al-, bi-, etc.) but all valid concatenations of these as well (waAl-, biAl-, wabiAl-, etc). The same applies to the dictionary of suffixes: (-ap, -At, -Ani, -athu, -Athum, -Anihi, -tumuwhA, etc). Here are some sample entries from the dictionary of prefixes: wl wali NPref-Li and + for/to wa/CONJ+li/PREP+ ll lil NPref-Li to/for + the li/PREP+Al/DET+ wll walil NPref-Lil and + to/for + the wa/CONJ+li/PREP+Al/DET+ wbAl wabiAl NPref-BiAl and + with/by the wa/CONJ+bi/PREP+Al/DET+ The first column contains the actual string that we look up, whereas the second column has the vocalized version of the same string. The third column has the morphological category (whose function is explained further below). The fourth column has the corresponding English glosses and contains part-of-speech information for the constituent morphemes. Here are some sample entries from the dictionary of stems (lines beginning with ";; " contain the lemma ID string): ;; Earabiy~_1 Erby Earabiy~ N/ap Arab Earabiy~/NOUN Erb Earab N Arabs Earab/NOUN Erby Earabiy~ N/ap Arab Earabiy~/ADJ Erb Earab N Arab Earab/ADJ ;; Earabiy~_2 Erby Earabiy~ N-ap Arabic;Arab Earabiy~/ADJ ;; Earabiy~_3 Erby Earabiy~ N0 Arabi Earabiy~/NOUN_PROP ;; Earabiy~ap_1 Erby Earabiy~ NapAt Arabic (language) Earabiy~/NOUN The following are sample entries from the dictionary of suffixes: p ap NSuff-ap [fem.sg.] +ap/NSUFF_FEM_SG Ak Aka NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ka/POSS_PRON_2MS Ak Aki NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ki/POSS_PRON_2FS If all three word elements (prefix, stem, suffix) are found in their respective lexicons, we then use their respective morphological categories (the string in column 3) to determine whether they are compatible. We ask: 1. is the morphological category of the prefix compatible with the morphological category of the stem? (i.e., is the combination found in the list of compatible prefix-stem morphological categories?) 2. if so, is the morphological category of the prefix compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible prefix-suffix morphological categories?) 3. if so, is the morphological category of the stem compatible with the morphological category of the suffix? (i.e., is the combination found in the list of compatible stem-suffix morphological categories?) If the answer to the last question is "yes" then the morphological analysis is valid. Example: INPUT STRING: ???? LOOK-UP WORD: wSfh SOLUTION 1: (waSafahu) [waSaf-i_1] waSaf/PV+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS (GLOSS): describe/characterize + he/it [verb] + it/him SOLUTION 2: (waSafahu) [waSaf-i_1] waSaf/PV+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS (GLOSS): prescribe/give a prescription to + he/it [verb] + it/him SOLUTION 3: (waSofuhu) [waSof_1] waSof/NOUN+u/CASE_DEF_NOM+hu/POSS_PRON_3MS (GLOSS): description/portrayal/characterization + [def.nom] + its/his SOLUTION 4: (waSofahu) [waSof_1] waSof/NOUN+a/CASE_DEF_ACC+hu/POSS_PRON_3MS (GLOSS): description/portrayal/characterization + [def.acc.] + its/his SOLUTION 5: (waSofihi) [waSof_1] waSof/NOUN+i/CASE_DEF_GEN+hi/POSS_PRON_3MS (GLOSS): description/portrayal/characterization + [def.gen.] + its/his ............. SOLUTION 12: (waSaf~ahu) [Saf~-u_1] wa/CONJ+Saf~/PV+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS (GLOSS): and + arrange/classify + he/it [verb] + it/him SOLUTION 13: (waSaf~uhu) [Saf~_1] wa/CONJ+Saf~/NOUN+u/CASE_DEF_NOM+hu/POSS_PRON_3MS (GLOSS): and + line/row/class + [def.nom] + its/his Solution #1 was found to be valid because: 1. All 3 components(null)+wSf+h exist in their respective lexicons (note that there is a literal entry for the null prefix): (null) (null) Pref-0 (null) wSf waSaf PV describe;characterize h ahu PVSuff-ah he/it it/him +a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS 2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables: 1. "Pref-0 PV" (listed in the table of compatible prefix-stem morphological categories) 2. "PV PVSuff-ah" (listed in the table of compatible stem-suffix morphological categories) 3. "Pref-0 PVSuff-ah" (listed in the table of compatible prefix-suffix morphological categories) Solution #13 was found to be valid because: 1. All 3 components w+Sf+h exist in their respective lexicons: w wa Pref-Wa and wa/CONJ+ Sf Saf~ Ndu line;row;class H h NSuff-h its/his +hu/POSS_PRON_3MS 2. The morphological categories of all 3 components are listed as compatible pairs in the relevant compatibility tables: 1. "Pref-Wa Ndu" (listed in the table of compatible prefix-stem morphological categories) 2. "Ndu NSuff-h" (listed in the table of compatible stem-suffix morphological categories) 3. "Pref-Wa NSuff-h" (listed in the table of compatible prefix-suffix morphological categories) The lexicon of stems used in the morphology analysis contains 79,416 entries and 40,457 lemmas.