Introduction
This file contains documentation on the Buckwalter Arabic Morphological Analyzer Version 2.0, Linguistic
Data Consortium (LDC) catalog number LDC2004L02 and ISBN 1-58563-311-9.
Note: This release, unlike Version 1, is available only to LDC members. To find out how to join, please consult our FAQ. There are additional licenseing terms that apply. To examine the license, please follow the Member License Online link above. You will also be presented with this license upon download and will be asked to accept. You must accept the terms in order for the download to proceed.
Data
The data
consists primarily of three Arabic-English lexicon files: prefixes (548 entries),
suffixes (906 entries), and stems (78,839 entries representing 40,219 lemmas). The
lexicons are supplemented by three morphological compatibility tables used for
controlling prefix-stem combinations (2,435 entries), stem-suffix combinations (1,612
entries), and prefix-suffix combinations (1,138 entries). The actual code for
morphology analysis and POS tagging is contained in a Perl script (AraMorph.pl).
Sample input (infile.txt) and corresponding output file (outfile.xml) are provided.
The documentation consists of a readme file with a description of the three lexicon
files, the three morphological compatibility tables, the morphology analysis
algorithm, and a table with the author's Arabic transliteration system.
Samples
To see an example of the analyzer's output, please examine this sample.
Availablity
The release is available to 2004 and 2006 members via download here. Copies may also be requested on CD for an additional fee of US$150.
Copyright
Portions © 2002-2004 QAMUS LLC (www.qamus.org),© 2002-2004 Trustees of the University of Pennsylvania |