COVERAGE STATISTICS The ATB1v3.0 (Arabic Treebank: Part 1 v3.0) Corpus contains 145,386 tokens, of which 21,591 are punctuation, numbers, and Latin strings, and 123,795 are Arabic word tokens. Punctuation, Numbers, Latin strings 21,591 Arabic Word Tokens 123,795 ------------------------------------------------ TOTAL 145,386 Of the 123,795 Arabic word tokens, 233 were found to be typos in the original text. Of the remaining 123,562 Arabic word tokens, 123,192 (99.70%) were provided with an accurate morphological analysis and POS tag by the Buckwalter Arabic Morphological Analyzer, and 370 (0.30%) were judged to be incorrectly analyzed, and were flagged with a Comment describing the nature of the inaccuracy. Accurately parsed Arabic Word Tokens 123,192 99.70% Commented Arabic Word Tokens 370 0.30% ------------------------------------------------------------ TOTAL 123,562 100.00% The Annotator Comments show that most of the words that were not provided with a correct morphological analysis were words missing in the Analyzer lexicon, passive verbs, and words with missing hamza (which could be considered a form of typo). ============================================ ANNOTATOR COMMENTS ON ITEMS WITH NO SOLUTION ============================================ NOT_IN_LEXICON (MISC) 204 55.14 PASSIVE_FORM 76 20.54 MISSING_HAMZA_PROBLEM 67 18.11 NOUN_SHOULD_BE_ADJ 9 2.43 GRAMMAR_PROBLEM 7 1.89 DIALECTAL_FORM 5 1.35 ADJ_SHOULD_BE_NOUN 2 0.54 -------------------------------------------- TOTAL 370 100.00 ============================================