Introduction
Arabic Treebank: Part 1 (ATB1) v 4.1, Linguistic Data Consortium
(LDC) catalog number LDC2010T13 and isbn 1-58563-566-9, was developed
at LDC. It consists of 734 newswire stories from Agence France Presse
(AFP) with part-of-speech (POS), morphology, gloss and syntactic
treebank annotation in accordance with the
Penn
Arabic Treebank (PATB) Guidelines developed in 2008 and 2009. This
release represents a significant revision of LDCs previous ATB1
publications:
Arabic Treebank: Part 1 v 2.0 LDC2003T06 and
Arabic Treebank: Part 1 v 3.0 (POS with full vocalization +
syntactic analysis) LDC2005T02.
The ongoing PATB project supports research in Arabic-language natural
language processing and human language technology development. The
methodology and work leading to the release of this publication are
described in detail in the documentation accompanying this corpus and
in two research papers: Enhancing
the Arabic Treebank: A Collaborative Effort toward New Annotation
Guidelines and
Consistent
and Flexible Integration of Morphological Annotation in the Arabic
Treebank.
Data
ATB1 v 4.1 contains a total of 145,386 tokens before clitics are
split, and 167,280 tokens after clitics are separated for the
treebank annotation.
Sponsorship
This work was supported in part by the Defense Advanced Research
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content
of this publication does not necessarily reflect the position or the
policy of the Government, and no official endorsement should be
inferred.
Samples
For an example of the data in this corpus, please review the
sample file.
Updates
No updates have been issued as of this time.
Content Copyright
Portions © 2000 Agence France Presse, © 2003, 2005, 2010
Trustees of the University of Pennsylvania
Contact: ldc@ldc.upenn.edu
© 2010 Linguistic Data
Consortium , Trustees of the
University of Pennsylvania . All Rights Reserved.
|