Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



ARL Urdu Speech Database, Training Data

Item Name: ARL Urdu Speech Database, Training Data
Authors: Appen Pty Ltd, Sydney, Australia
LDC Catalog No.: LDC2007S03
ISBN: 1-58563-412-3
Release Date: Feb 20, 2007
Data Type: speech
Sample Rate: 22050 Hz
Sampling Format: pcm
Data Source(s): microphone speech
Language(s): Urdu
Language ID(s): urd
Distribution: 8 DVD, Web Download
Member fee: $0 for 2007 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $1600.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Appen Pty Ltd, Sydney, Australia
2007
ARL Urdu Speech Database, Training Data
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation for ARL Urdu Speech Database, Training Data, Linguistic Data Consortium (LDC) catalog number LDC2007S03 and isbn 1-58563-421-3.

The recordings in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S. Army Research Laboratory (ARL) provided this corpus to the LDC for distribution.

Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan.

The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. The distribution of speaker dialects is as follows:

AccentNumber of Speakers
South Sindh29
North Sindh30
South Punjab27
North Punjab29
Captial Area29
North West Regions30
Baluchistan26

The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test).

Data

Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories.

Each utterance is transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full.

Update

Earlier versions were missing the content list file. This is now available as a download. Please contact the LDC membership office to receive instructions for download.

Samples

For an example of the data in this corpus, please listen to this following audio sample (.wav format)

Content Copyright

Portions © 2006 U.S. Army Research Laboratory, © 2007 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.