BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts was produced by Linguistic Data
Consortium (LDC) catalog number LDC2005S08 and ISBN 1-58563-296-1.
This corpus consists of transcribed, spontaneous speech, recorded from
subjects speaking in Levantine colloquial Arabic. Levantine Arabic is
the dialect of Arabic spoken by ordinary people in Lebanon, Jordan,
Syria, and Palestine. It is significantly different from Modern
Standard Arabic (MSA), in that it is a spoken rather than a written
language. It includes different word pronounciations, and even
different words, from Modern Standard Arabic, the written and
"official" form of Arabic.
The corpus was developed with funding from the Defense Advanced
Research Project Agency (DARPA), as part of the Babylon program. The
Babylon program is intended to advance the state of the art in
speech-to-speech translation systems, both by creating new technology
and by developing systems for field use. More information on the
Babylon program may be found at this site. BBN
was funded under Babylon to develop a limited English/Arabic
refugee/medical speech translation system for a handheld computer, and
collected this corpus as part of its work. The corpus would be useful
for anyone attempting to do speech recognition in Levantine colloquial
Arabic, including for speech translation and spoken dialog
To see an example of this corpus, we have provided a audio sample and transcription.
Portions © 2003 BBNT Solutions LLC, © 2004 Trustees of the University of Pennsylvania