Introduction
Arabic-Dialect/English Parallel Text was developed by Raytheon
BBN Technologies (BBN), LDC and
Sakhr Software and contains approximately
3.5 million tokens of Arabic dialect sentences and their English translations.
Data
The data in this corpus consists of Arabic web text as follows:
1. Filtered automatically from large Arabic text corpora harvested from the
web by LDC. The LDC corpora consisted largely of weblog and online user groups
and amounted to around 350 million Arabic words. Documents that contained a
large percentage of non-Arabic or Modern Standard Arabic (MSA) words were eliminated.
A list of dialect words was manually selected by culling through the Levantine
Fisher (LDC2005S07,
LDC2005T03, LDC2007S02
and LDC2007T04) and Egyptian CALLHOME speech corpora (LDC97S45,
LDC2002S37, LDC97T19
and LDC2002T38)
distributed by LDC. That list was then used to retain documents that contained
a certain number of matches. The resulting subset of the web corpora contained
around four million words. Documents were automatically segmented into passages
using formatting information from the raw data.
2. Manually harvested by Sakhr Software from Arabic dialect web sites.
Dialect classification and sentence segmentation, as needed, and translation
into English were performed by BBN through Amazon's
Mechanical Turk. Arabic annotators from Mechanical Turk classified filtered
passages as being either MSA or one of four regional dialects: Egyptian, Levantine,
Gulf/Iraqi or Maghrebi. An additional "General" dialect option was
allowed for ambiguous passages. The classification was applied to whole passages
rather than individual sentences. Only the passages labeled Levantine and Egyptian
were further processed. The segmented Levantine and Egyptian sentences were
then translated. Annotators were instructed to translate completely and accurately
and to transliterate Arabic names. They were also provided with examples. All
segments of a passage were presented in the same translation task to provide
context.
Samples
Please follow this link
for a sample of the data in this release.
Updates
None at this time.
Content Copyright
Portions © 2012 Raytheon BBN Technologies, © 2012 Sakhr Software,
© 1997, 2002, 2005-2009, 2012 Trustees of the University of Pennsylvania
|