Introduction
This file contains documentation for the LDC Spoken Language Sampler, Linguistic
Data Consortium catalog number LDC2008S08 and ISBN 1-585630-495-6.
The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes
a wide and growing assortment of resources for researchers, engineers and educators
whose work is concerned with human languages. Historically, most linguistic
resources were not generally available to interested researchers but were restricted
to single laboratories or to a limited number of users. Inspired by the success
of selected readily available and well-known data sets, such as the Brown University
text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale
corpus development and sharing of resources. In 2008, LDC is a growing consortium
that includes more than 100 companies, universities, and government members
that has distributed over 50,000 corpora to a global audience. With the support
of its members, LDC is able to provide critical services to the language research
community. These services include: maintaining the data archives, producing
and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating
intellectual property agreements with potential information providers and would-be
members, and maintaining relations with other like-minded groups around the
world. Resources available from LDC (http://www.ldc.upenn.edu) include speech, text
and video data and lexicons in multiple languages, as well as software tools
to facilitate the use of corpus materials.
Data
The LDC Spoken Language Sampler provides a variety of speech, transcript and
lexicon samples and is designed to illustrate the variety and breadth of the
resources available from LDC Publication Catalog.
- most excerpts are truncated to be much shorter than the original files,
typically one minute and thirty seconds of speech;
- signal amplitude has been adjusted where necessary to normalize playback
volume;
- some corpora are published in compressed form, but all samples here are
uncompressed;
- LDC typically uses NIST SPHERE file format for audio data, but the audio
files in this sampler have been converted to MS-WAV/audio (RIFF) file format
for compatibility with typical browser audio utilities.
The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts.
| An English Dictionary of the Tamil Verb |
This dictionary contains translations for over 6000 English verbs and
defines over 9000 Tamil verbs. Entries include the English word, the Tamil
equivalent in transliteration and Tamil script and audio examples in Spoken
Tamil pronunciation. |
| CALLFRIEND Farsi |
A corpus of 60 unscripted telephone calls between friends
and acquaintances speaking in their native language, Farsi. |
| CALLFRIEND Tamil |
A corpus of 60 unscripted telephone calls between friends and acquaintances
speaking in their native language, Tamil. |
CALLHOME Japanese |
A corpus of 120 unscripted telephone conversations between native Japanese
speakers and a corpus of associated transcripts. |
| CALLHOME Spanish |
A corpus of 120 unscripted telephone conversations between native Spanish
speakers and a corpus of associated transcripts. |
| CSLU Kids' Speech |
Developed at Oregeon State University's Center for Spoken Language Understanding,
this corpus is a collection of spontaneous and prompted speech from 1100
children from Kindergarten through Grade 10. |
| Fisher Levantine Arabic |
A collection of 279 Levantine Arabic telephone conversations and transcripts
from speakers of several nationalities. |
| Grassfields Bantu Fieldwork: Dschang Tone Paradigms |
Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields
Bantu) language spoken by 300,000+ people in Southwestern Cameroon. |
| Gulf Arabic Conversational Telephone Speech |
Contains 975 telephone conversations from speakers across the Persian
Gulf region and their transcriptions. |
| Korean Telephone Speech |
Collection of 100 telephone conversations between native Korean speakers
and their transcriptions. |
| Mawukakan Lexicon |
The first publication of an ongoing project aiming to build an electronic
dictionary of four Mandekan [Eastern Manding languages of the Mande Group
of the Niger-Congo family] languages. |
| Nationwide Speech Project |
A database of speech representing current regional accents and dialects
of the United States. |
| NIST Pilot Meeting Speech |
Collects speech and transcriptions from topical discussions in meeting
settings including complete descriptive metadata and detailed descriptions
of the physical environment in which the discussions took place. |
| West Point Russian
Speech |
Utterances of sentences in Russian from 1,891 native and non-native speakers. |
How to Obtain
The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler.
Content Copyright
Portions © 2008 Trustees of the University of Pennsylvania |