Introduction
This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic
Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9.
This release of the English-Arabic Treebank consists of 52,238 words in 224
files of individual Agence France Presse (AFP) news stories (corresponding
to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0
-- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English
translation was provided by LDC, and was part-of-speech tagged and
treebanked for this project.
Data
The guidelines followed for both part-of-speech and treebank annotation
are essentially Penn Treebank II style, with two notable differences:
- POS: tokenization of hyphenated items ("New York-based" has been
replaced by "New York - based" for example), and the addition of HYPH
and AFX tags necessitated by this change in tokenization
- TreeBank: the addition of the node label NML for sub-NP nominal
constituents (replacing NX and most NP-internal NAC)
Samples
For an example of the data in this corpus, please review this text sample.
Copyright
Portions © 2000 Agence France Presse, © 2006 Trustees of the University of Pennsylvania |