Introduction
Buckwalter Arabic Morphological Analyzer Version 1.0 was produced by
Linguistic Data Consortium (LDC), catalog number LDC2002L49 and ISBN
1-58563-257-0. The Buckwalter Arabic Morphological Analyzer is
used for POS-tagging Arabic text.
Data
The data consists primarily of three Arabic-English lexicon files:
prefixes (299 entries), suffixes (618 entries), and stems (82,158 entries
representing 38,600 lemmas). The lexicons are supplemented by three
morphological compatibility tables used for controlling prefix-stem
combinations (1,648 entries), stem-suffix combinations (1,285 entries), and
prefix-suffix combinations (598 entries). The actual code for morphology
analysis and POS tagging is contained in a Perl script. The documentation
consists of a readme file with a description of the lexicon files, the
morphological compatibility tables, the morphology analysis algorithm, a
summary of stem morphological categories, and a table with the author's
Arabic transliteration system.
Updates
There has been a case mismatch in the manner by which six files were named in
the data, compared with their names in the documentation and the script, which
caused the analyzer to crash on case sensitive systems. This problem has
been remedied and you can now download the fixed version of the analyzer.
Content Copyright
Portions © 2002 QAMUS LLC (www.qamus.org),
© 2002 Trustees of the University of Pennsylvania
The Linguistic Data Consortium is releasing this software under
the
GNU General Public License; organizations interested in licensing the lexicon
and/or morphological analyzer for commercial use should contact:
QAMUS LLC
448 South 48th St.
Philadelphia, PA 19143
ATTN: Tim Buckwalter
email: info@qamus.org
Note
This corpus is free of charge as a web download distribution; a request must be submitted to ldc@ldc.upenn.edu to obtain the data. Note that there is a $100 charge if requested on a CD-ROM. |