ARABIC PART-OF-SPEECH/MORPHOLOGICAL ANALYSIS TAGGING
The Penn Arabic Treebank uses a level of annotation more accurately
described as morphological analysis than as part-of-speech tagging. In
October 2001, the decision was taken to use Tim Buckwalter's morphological
analyzer and main lexicon, which currently contains over 77,800 stem
entries representing some 45,000 lexical items.
A DESCRIPTION OF TIM BUCKWALTER'S ARABIC MORPHOLOGICAL ANALYSIS TOOL
The Arabic morphological analysis and part-of-speech tagging was performed
with the Buckwalter Arabic Morphological Analyzer, an open-source software
package distributed by the LDC. The source code of the program and a full
technical description can be downloaded for free from the LDC website:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49.
What follows is a brief description of the Arabic morphology analysis
algorithm and the structure of lexicon entries.
The Arabic morphology analysis is based on these assumptions:
1. Arabic words are composed of three elements: prefix, stem, and suffix
2. Prefix length is 0-4 characters
3. Stem length is 1-infinite characters
4. Suffix length is 0-6 characters
Given these rules, an Arabic word can be segmented as follows (using
wbAlErbyp as an example):
Prefix Stem Suffix
wbAlErbyp
wbAlErby p
wbAlErb yp
wbAlEr byp
wbAlE rbyp
wbAl Erbyp
wbA lErbyp
w bAlErbyp
w bAlErby p
w bAlErb yp
w bAlEr byp
w bAlE rbyp
w bAl Erbyp
w bA lErbyp
wb AlErbyp
wb AlErby p
wb AlErb yp
wb AlEr byp
wb AlE rbyp
wb Al Erbyp
wb A lErbyp
wbA lErbyp
wbA lErby p
wbA lErb yp
wbA lEr byp
wbA lE rbyp
wbA L Erbyp
wbAl Erbyp
wbAl Erby p
wbAl Erb yp
wbAl Er byp
wbAl E rbyp
Arabic dictionary look-up consists of asking, for each segmentation:
1. does the prefix exist in the lexicon of prefixes?
2. if so, does the stem exist in the lexicon of stem?
3. if so, does the suffix exist in the lexicon of suffixes?
Note that the dictionary of prefixes contains not only the individual
prefixes (wa-, fa-, li-, Al-, bi-, etc.) but all valid concatenations of
these as well (waAl-, biAl-, wabiAl-, etc). The same applies to the
dictionary of suffixes: (-ap, -At, -Ani, -athu, -Athum, -Anihi, -tumuwhA,
etc).
Here are some sample entries from the dictionary of prefixes:
wl wali NPref-Li and + for/to wa/CONJ+li/PREP+
ll lil NPref-Li to/for + the li/PREP+Al/DET+
wll walil NPref-Lil and + to/for + the wa/CONJ+li/PREP+Al/DET+
wbAl wabiAl NPref-BiAl and + with/by the wa/CONJ+bi/PREP+Al/DET+
The first column contains the actual string that we look up, whereas the
second column has the vocalized version of the same string. The third
column has the morphological category (whose function is explained further
below). The fourth column has the corresponding English glosses and
contains part-of-speech information for the constituent morphemes.
Here are some sample entries from the dictionary of stems (lines beginning
with ";; " contain the lemma ID string):
;; Earabiy~_1
Erby Earabiy~ N/ap Arab Earabiy~/NOUN
Erb Earab N Arabs Earab/NOUN
Erby Earabiy~ N/ap Arab Earabiy~/ADJ
Erb Earab N Arab Earab/ADJ
;; Earabiy~_2
Erby Earabiy~ N-ap Arabic;Arab Earabiy~/ADJ
;; Earabiy~_3
Erby Earabiy~ N0 Arabi Earabiy~/NOUN_PROP
;; Earabiy~ap_1
Erby Earabiy~ NapAt Arabic (language) Earabiy~/NOUN
The following are sample entries from the dictionary of suffixes:
p ap NSuff-ap [fem.sg.] +ap/NSUFF_FEM_SG
Ak Aka NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ka/POSS_PRON_2MS
Ak Aki NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ki/POSS_PRON_2FS
If all three word elements (prefix, stem, suffix) are found in their
respective lexicons, we then use their respective morphological categories
(the string in column 3) to determine whether they are compatible. We ask:
1. is the morphological category of the prefix compatible with the
morphological category of the stem? (i.e., is the combination found in the
list of compatible prefix-stem morphological categories?)
2. if so, is the morphological category of the prefix compatible with the
morphological category of the suffix? (i.e., is the combination found in
the list of compatible prefix-suffix morphological categories?)
3. if so, is the morphological category of the stem compatible with the
morphological category of the suffix? (i.e., is the combination found in
the list of compatible stem-suffix morphological categories?)
If the answer to the last question is "yes" then the morphological analysis
is valid.
Example:
INPUT STRING: ????
LOOK-UP WORD: wSfh
SOLUTION 1: (waSafahu) [waSaf-i_1] waSaf/PV+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): describe/characterize + he/it [verb] + it/him
SOLUTION 2: (waSafahu) [waSaf-i_1] waSaf/PV+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): prescribe/give a prescription to + he/it [verb] + it/him
SOLUTION 3: (waSofuhu) [waSof_1] waSof/NOUN+u/CASE_DEF_NOM+hu/POSS_PRON_3MS
(GLOSS): description/portrayal/characterization + [def.nom] + its/his
SOLUTION 4: (waSofahu) [waSof_1] waSof/NOUN+a/CASE_DEF_ACC+hu/POSS_PRON_3MS
(GLOSS): description/portrayal/characterization + [def.acc.] + its/his
SOLUTION 5: (waSofihi) [waSof_1] waSof/NOUN+i/CASE_DEF_GEN+hi/POSS_PRON_3MS
(GLOSS): description/portrayal/characterization + [def.gen.] + its/his
.............
SOLUTION 12: (waSaf~ahu) [Saf~-u_1] wa/CONJ+Saf~/PV+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): and + arrange/classify + he/it [verb] + it/him
SOLUTION 13: (waSaf~uhu) [Saf~_1] wa/CONJ+Saf~/NOUN+u/CASE_DEF_NOM+hu/POSS_PRON_3MS
(GLOSS): and + line/row/class + [def.nom] + its/his
Solution #1 was found to be valid because:
1. All 3 components(null)+wSf+h exist in their respective lexicons (note
that there is a literal entry for the null prefix):
(null) (null) Pref-0 (null)
wSf waSaf PV describe;characterize
h ahu PVSuff-ah he/it it/him +a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
2. The morphological categories of all 3 components are listed as
compatible pairs in the relevant compatibility tables:
1. "Pref-0 PV" (listed in the table of compatible prefix-stem
morphological categories)
2. "PV PVSuff-ah" (listed in the table of compatible stem-suffix
morphological categories)
3. "Pref-0 PVSuff-ah" (listed in the table of compatible prefix-suffix
morphological categories)
Solution #13 was found to be valid because:
1. All 3 components w+Sf+h exist in their respective lexicons:
w wa Pref-Wa and wa/CONJ+
Sf Saf~ Ndu line;row;class
H h NSuff-h its/his +hu/POSS_PRON_3MS
2. The morphological categories of all 3 components are listed as
compatible pairs in the relevant compatibility tables:
1. "Pref-Wa Ndu" (listed in the table of compatible prefix-stem
morphological categories)
2. "Ndu NSuff-h" (listed in the table of compatible stem-suffix
morphological categories)
3. "Pref-Wa NSuff-h" (listed in the table of compatible prefix-suffix
morphological categories)
The lexicon of stems used in the morphology analysis contains 79,416
entries and 40,457 lemmas.