Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts

Item Name: BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
Authors: BBN Technologies (with American University of Beirut a subcontractor): John Makhoul, Bushra Zawaydeh, Frederick Choi, and David Stallard
LDC Catalog No.: LDC2005S08
ISBN: 1-58563-296-1
Release Date: Jan 15, 2005
Data Type: speech
Sample Rate: 16000 Hz
Sampling Format: pcm
Data Source(s): microphone speech
Project(s): DARPA-CSR, EARS, GALE
Application(s): machine translation, speech recognition, spoken dialogue systems
Language(s): Levantine Arabic, North Levantine Arabic, South Levantine Arabic
Language ID(s): AJP, APC
Distribution: 2 DVD
Member fee: $0 for 2005 members
Non-member Fee: US $2500.00
Reduced-License Fee: US $1250.00
Extra-Copy Fee: US $400.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: BBN Technologies (with American University of Beirut a subcontractor): John Makhoul, et al.
2005
BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
Linguistic Data Consortium, Philadelphia

Introduction

BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2005S08 and ISBN 1-58563-296-1.

This corpus consists of transcribed, spontaneous speech, recorded from subjects speaking in Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic (MSA), in that it is a spoken rather than a written language. It includes different word pronounciations, and even different words, from Modern Standard Arabic, the written and "official" form of Arabic.

The corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program is intended to advance the state of the art in speech-to-speech translation systems, both by creating new technology and by developing systems for field use. More information on the Babylon program may be found at this site. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and collected this corpus as part of its work. The corpus would be useful for anyone attempting to do speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems.

Samples

To see an example of this corpus, we have provided a audio sample and transcription.

Copyright

Portions © 2003 BBNT Solutions LLC, © 2004 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.