Coverage Statistics The ATB3 (Arabic Treebank: Part 3 v1.0) Corpus contains 340,281 tokens, of which 47,246 are punctuation, numbers, and Latin strings, and 293,035 are Arabic word tokens. Punctuation, Numbers, Latin strings 47,246 Arabic Word Tokens 293,035 ---------------------------------------------- TOTAL 340,281 Of the 293,035 Arabic word tokens, 290,842 (99.25%) were provided with an accurate morphological analysis and POS tag by the Buckwalter Arabic Morphological Analyzer, and 2,193 (0.75%) Arabic word tokens were judged to be incorrectly analyzed, and were flagged with a Comment describing the nature of the inaccuracy. (Note that 237 of the 2,193 tokens for which no correct analysis was found were found to be typos in the original text, so the Analyzer overall accuracy rate is actually 99.33%). Accurately parsed Arabic Word Tokens 290,842 99.25% Commented Arabic Word Tokens 2,193 0.75% ------------------------------------------------------- TOTAL 293,035 100.00% The Annotator Comments show that most of the words that were not provided with a correct morphological analysis were words missing in the Analyzer lexicon (see table below). ============================================ ANNOTATOR COMMENTS ON ITEMS WITH NO SOLUTION ============================================ NOT_IN_LEXICON (MISC) 796 36.30 NOT_IN_LEXICON ADJ 317 14.46 NOT_IN_LEXICON NOUN 279 12.72 PASSIVE_FORM 251 11.45 TYPO 237 10.81 DIALECTAL_FORM 70 3.19 NOT_IN_LEXICON PLURAL 56 2.55 IMPERATIVE 49 2.23 NOT_IN_LEXICON FOREIGN WORD 44 2.01 NOT_IN_LEXICON VERB 24 1.09 GRAMMAR_PROBLEM 13 0.59 NOT_IN_LEXICON ADV 13 0.59 NOUN_SHOULD_BE_ADJ 11 0.50 ABBREV 10 0.46 INTERR_PARTICLE 10 0.46 NOT_IN_LEXICON DUAL 9 0.41 ADJ_SHOULD_BE_NOUN 4 0.18 -------------------------------------------- TOTAL 2193 100.00 ============================================