Introduction
The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1
was developed by researchers at LDC. SAMA 3.1 is based on, and
updates, Buckwalter
Arabic Morphological Analyzer (BAMA) 2.0 (LDC2004L02), which was
developed by Tim Buckwalter. Since this is the first public release
of SAMA, it has been numbered continuously to reflect the continuity
between this release and previous BAMA releases.
SAMA 3.1 is a software tool for the morphological analysis of
Standard Arabic. SAMA 3.1 considers each Arabic word token in all
possible 'prefix-stem-suffix' segmentations, and lists all
known/possible annotation solutions, with assignment of all
diacritic marks, morpheme boundaries (separating clitics and
inflectional morphemes from stems), and all Part-of-Speech (POS)
labels and glosses for each morpheme segment. The generated output
may then be reviewed by users, and the most appropriate annotation
selected from among several choices.
The software layer of SAMA 3.1 relies on a data layer that consists
primarily of three Arabic-English lexicon files: prefixes (1328
entries), suffixes (945 entries), and stems (79318 entries
representing 40654 lemmas). The lexicons are supplemented by three
morphological compatibility tables used for controlling prefix-stem
combinations (2497 entries), stem-suffix combinations (1632
entries), and prefix-suffix combinations (1180 entries).
Differences since BAMA 2.0
The input format, output format, and data layer of SAMA 3.1 were
designed to be backward compatible with BAMA. Incremental changes to
the data layer in SAMA have resulted in:
- increased lexicon coverage in the dictionary files
- important changes and additions to the inventory of POS
tags
- more possible solutions generated for numerous word forms
Data-layer changes are summarized in more detail in the
"table_updates*.txt" documentation files included in the
corpus documentation.
The software implementation has been updated to allow more
input/output options, installation and configuration options, and
smoother incorporation in other Perl tools/services.
The structure of the dictionary and morphotactic tables has
remained the same (the tables provided with SAMA 3.1 differ from the
BAMA 2.0 tables only in size and content, not in format). Logical
separation between the software layer and data layer allows the new
software tools to be used with previous versions of the tables
(instructions are provided with software documentation).
The basic logic that implements the segmentation and analysis
look-up for Arabic words is essentially unchanged since BAMA 2.0.
The perldoc documentation for the "SAMA.pm" Perl module
gives a full account of the tokenization logic.
The data layer is now accessed through Berkeley DB, with
result-caching enabled by default, leading to improved performance.
Various utility scripts have also been added to the software package
to facilitate more flexible interaction with tools and data.
UTF-8 is now the default input/output and internal character
encoding, with automatic conversion of different input encodings
(cp1256, iso-8859-6, and Buckwalter transliteration are also
accepted). With this change, the use of UTF-8 as input is now fully
supported, eliminating a range of problems that would result from
having to convert to cp1256 for analysis. Full details about
input/output options are provided in the "SAMA.pm"
documentation.
Further details on changes in software options and implementation
may be found in the perldoc software tool documentation, and in the
Changes*.txt documentation files.
Dependencies
There are two dependencies for installing and using SAMA 3.1: the
DB_File.pm module (available from CPAN), and Encode::Buckwalter
(included with the SAMA 3.1 distribution). The DB_File module in
turn requires that the Berkeley DB libraries be present.
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Updates
There are no updates available at this time.
Content Copyright
Portions © 2002-2004 QAMUS LLC, © 2002-2010 Trustees of
the University of Pennsylvania |