Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Arabic Treebank: Part 2 v 2.0

Item Name: Arabic Treebank: Part 2 v 2.0
Authors: Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Hubert Jin
LDC Catalog No.: LDC2004T02
ISBN: 1-58563-282-1
Release Date: Jan 30, 2004
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): automatic content extraction, cross-lingual information retrieval, information detection, natural language processing
Language(s): Modern Standard Arabic
Language ID(s): arb
Distribution: 1 CD
Member fee: $0 for 2004 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Mohamed Maamouri, et al.
2004
Arabic Treebank: Part 2 v 2.0
Linguistic Data Consortium, Philadelphia

Introduction

Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T02 and ISBN 1-58563-282-1.

This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. Part one was released in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted from Agence France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of stories from Al-Hayat distributed by Ummah.

Data

This corpus includes 501 stories from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic tokens such as numbers and punctuation) in the 501 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.

The corpus contains 125,698 Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly.

Updates

There are no updates available at this time.

Content Copyright

Portions © 2001-2002 Ummah Press, © 2004 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.