Coverage Statistics The ATB4 (Arabic Treebank: Part 4 v1.0) Corpus contains 161,914 tokens, of which 15,423 are punctuation, numbers, and Latin strings, and 146,491 are Arabic word tokens. Punctuation, Numbers, Latin strings 15,423 Arabic Word Tokens 146,491 ------------------------------------------------ TOTAL 161,914 Of the 146,491 Arabic word tokens, 145,043 (99.01%) were provided with an accurate morphological analysis and POS tag by the Buckwalter Arabic Morphological Analyzer, and 1,448 (0.99%) Arabic word tokens were judged to be incorrectly analyzed, and 736 cases were provided with a Comment describing the nature of the inaccuracy. Accurately parsed Arabic Word Token 145,043 99.01% Commented Arabic Word Tokens 1,448 0.99% ------------------------------------------------------ TOTAL 146,491 100.00% The 736 Annotator Comments show that most of the words that were not provided with a correct morphological analysis were passive verb forms (see table below). ============================================ ANNOTATOR COMMENTS ON ITEMS WITH NO SOLUTION ============================================ TYPO 137 PASSIVE_FORM 136 SHOULD_BE_ADJ 81 SEGMENTATION PROBLEM 71 SHOULD_BE_NOUN 56 IMPERATIVE 44 PLURAL 23 GRAMMAR_PROBLEM 17 ABBREVIATION 15 MISSING_HAMZA_PROBLEM 15 DIALECTAL_FORM 10 A_NAME 2 MISC (no comment) 129 -------------------------------------------- TOTAL 736 ============================================