Introduction
To support the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content extraction
(topic extraction and/or grammar extraction), cross-lingual information
retrieval, information detection, and other forms of linguistic research on
Modern Standard Arabic in general, the LDC was sponsored to develop an
Arabic Treebank of 1,000,000 words. This corpus is a re-release of part
one of that project, with the addition in Version 3.0 of improved
morphological/part-of-speech annotation (including full vocalization and
case endings).
Data
The project targets the description of a written Modern Standard Arabic
corpus from the Agence France Presse (AFP) newswire archives for
July-November 2000 (files dated 20000/7/15 to 2000/11/15). This corpus
includes 734 stories representing 145,386 words (166,068 tokens after
clitic segmentation in the Treebank; the number of Arabic tokens is
123,796). For this work, annotators must be native speakers of Arabic,
and they must understand enough linguistics to check morphosyntactic
analysis and build syntactic structures.
Samples
To see an example of this corpus, please examine the following samples:
Content Copyright
Portions © 2000 Agence France Presse, Portions © 2005 Trustees of the University of Pennslyvania |