Introduction
Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004T02 and ISBN 1-58563-282-1.
This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic.
Part one was released in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted from Agence France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of stories from Al-Hayat distributed by Ummah.
Data
This corpus includes 501 stories from the Ummah Arabic News Text. There are
a total of 144,199 words (counting non-Arabic tokens such as numbers and
punctuation) in the 501 files - one story per file. New features of annotation
include complete vocalization (including case endings), lemma IDs, and
more specific POS tags for verbs and particles.
The corpus contains 125,698 Arabic-only word tokens (prior to the
separation of clitics), of which 124,740 (99.24%) were provided with an
acceptable morphological analysis and POS tag by the morphological parser,
and 958 (0.76%) were items that the morphological parser failed to analyze
correctly.
Updates
There are no updates available at this time.
Content Copyright
Portions © 2002 Ummah Press, © 2004 Trustees of the University of Pennsylvania |