Introduction
MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase 1 Training Set contains all training data created by the Linguistic Data Consortium (LDC) to support
Phase 1 of the DARPA MADCAT Program. The material in this release consists of handwritten
Arabic documents, scanned at high resolution and annotated for the physical
coordinates of each line and token. Digital transcripts and English translations
of each document are also provided, with the various content and annotation
layers integrated in a single MADCAT XML output.
The goal of the MADCAT program is to automatically convert foreign text images
into English transcripts. MADCAT Phase 1 data was collected by LDC from Arabic source
documents in three genres: newswire, weblog and newsgroup text. Arabic speaking
scribes copied documents by hand, following specific instructions on writing
style (fast, normal, careful), writing implement (pen, pencil) and paper (lined,
unlined). Prior to assignment, source documents were processed to optimize their
appearance for the handwriting task, which resulted in some original source
documents being broken into multiple pages for handwriting. Each resulting
handwritten page was assigned to up to five independent scribes, using different
writing conditions.
The handwritten, transcribed documents were checked for quality and
completeness, then each page was scanned at a high resolution (600 dpi, greyscale)
to create a digital version of the handwritten document. The scanned images
were then annotated to indicate the physical coordinates of each line and token.
Explicit reading order was also labeled, along with any errors produced by the
scribes when copying the text.
The final step was to produce a unified data format that takes multiple
data streams and generates a single xml output file which contains all
required information. The resulting xml file has these distinct components: a text
layer that consists of the source text, tokenization and sentence
segmentation an image layer that consist of bounding boxes a scribe
demographic layer that consists of scribe ID and partition (train/test)
and a document metadata layer.
Data
This release includes 9693 annotation files in MADCAT XML format
(.madcat.xml) along with their corresponding scanned image files in TIFF
format.
Files are named as follows:
- galeID_page#_scribeID.{tif|madcat.xml}
Samples
Please follow the links for
image and xml samples.
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-004 and GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Updates
None at this time.
Content Copyright
Portions © 2007 Al-Ahram, Al Hayat, Al Quds - Al Arabi, Asharq Al-Awsat,
An Nahar, Assabah, © 2007-2010, 2012 Trustees of the University of Pennsylvania
|