| Application(s): | cross-lingual information retrieval, information extraction, information retrieval, language modeling, language teaching, machine translation, parsing, tagging |
Introduction
Prague Arabic Dependency Treebank (PADT) not only consists of
multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software
implementations designed for general use in Natural Language Processing (NLP).
The PADT project might be summarized as an open-ended activity of
the Center for Computational Linguistics, the Institute of
Formal and Applied Linguistics, and the Institute of Comparative
Linguistics, Charles University in Prague, resting
in multi-level annotation of Arabic language resources in
the light of the theory of Functional Generative Description
. The
project is a younger sibling to Prague Dependency Treebank
for Czech, and is maintained upon
co-operation with the Linguistic Data Consortium, University
of Pennsylvania, who release non-annotated corpora of
Arabic newswire and develop an independent Penn Arabic
Treebank.
Data Survey
The corpus of PADT 1.0 consists of morphologically
and analytically annotated newswire texts of Modern Standard
Arabic, which originate from the Arabic
Gigaword and the plain data of Penn Arabic
Treebank, Part 1 and Penn Arabic
Treebank, Part 2.
The PADT 1.0 distribution comprises over
113,500 tokens of data annotated analytically and provided
with the disambiguated morphological information. In addition,
the release includes complete annotations of
MorphoTrees resulting in more than 148,000 tokens, 49,000
of which have received the analytical processing. The
contents are further divided into data sets as indicated in the Table.
| Data Set | [A] Tokens [M] | Tokens/Para | Tokens/Doc | Original Data Provider | News Period | Related Corpora
|
| AFP | 13,000 | N/A | 34.6 [N/A] | 260 [N/A] | Agence France Presse | July 2000 | Penn ATB Part 1
|
| UMH | 38,500 | N/A | 43.6 [N/A] | 290 [N/A] | Ummah Press Service | Spring 2002 | Penn ATB Part 2
|
| XIN | 13,500 | N/A | 31.2 [N/A] | 155 [N/A] | Xinhua News Agency | May 2003 | Arabic Gigaword
|
| ALH | 10,000 | 73,500 | 47.0 [47.8] | 405 [405] | Al Hayat News Agency | September 2001 | Arabic Gigaword
|
| ANN | 12,500 | 25,500 | 60.3 [50.3] | 740 [630] | An Nahar News Agency | November 2002 | Arabic Gigaword
|
| XIA | 26,500 | 49,500 | 29.7 [25.9] | 235 [205] | Xinhua News Agency | May 2003 | Arabic Gigaword
|
In the Table, tokens give the number of syntactic units that are annotated [A] analytically [M] within MorphoTrees.
Approximate ratios of tokens per paragraph and tokens per document come in the next columns, distinguishing the two types of annotation.
The sets of selected documents could cover only a couple of days of the specified period of time.
Samples
Preview of paragraph morphology tree.
New analytical rendering style.
Support
PADT 1.0 was supported by the Ministry of Education of the Czech Republic, projects
LN00A063 and MSM113200006, and by the Grant Agency of the Czech Republic, project
405/02/0823.
Updates
Updates or bug fixes may be available in the LDC catalog entry
for this corpus, or at the PADT website.
Your questions and suggestions are welcome at
padt (at) ckl (dot) mff (dot) cuni (dot) cz. |