|

|
|
Asian Spoken Language Sampler
| |
| Item Name: | Asian Spoken Language Sampler |
| Authors: | Linguistic Data Consortium |
| LDC Catalog No.: | LDC2010S07 |
| ISBN: | 1-58563-559-6 |
| Data Source(s): | microphone speech, telephone speech |
| Language(s): | Cantonese, Farsi, Gulf Arabic, Hindi, Japanese, Korean, Levantine Arabic, Mandarin Chinese, Russian, Tamil, Urdu, Vietnamese |
| Language ID(s): | afb, ajp, apc, cmn, fas, hin, jpn, kor, rus, tam, urd, vie, yue |
| Distribution: | Web Download |
| Member fee: | $0 for 2010, 2010 members |
| Non-member Fee: | US $0.00 |
| Reduced-License Fee: | US $0.00 |
| Extra-Copy Fee: | N/A |
| Licensing Instructions: | Subscription Members, Standard Members, Non-Members |
| Citation: | Linguistic Data Consortium 2010 Asian Spoken Language Sampler Linguistic Data Consortium, Philadelphia |
|
Introduction
The Linguistic Data Consortium (LDC) at the University of Pennsylvania
distributes a wide and growing assortment of resources for researchers,
engineers and educators whose work is concerned with human languages.
Historically, most linguistic resources were not generally available to
interested researchers but were restricted to single laboratories or to
a limited number of users. Inspired by the success of selected, readily
available and well-known data sets, such as the Brown University text
corpus, LDC was founded in 1992 to provide a new mechanism for
large-scale corpus development and sharing of resources. With the support of its members, LDC is
able to provide critical services to the language research community.
These services include: maintaining the data archives, producing and
distributing data via media (DVD-ROM or CD-ROM) or web downloads,
negotiating intellectual property agreements with data providers and
maintaining relations with other like-minded groups around the world.
Resources available from LDC (http://www.ldc.upenn.edu) include speech,
text and video data and lexicons in multiple languages, as well as
software tools to facilitate the use of corpus materials. For a
complete view of LDCs publications, a searchable catalog is available
at
http://www.ldc.upenn.edu/Catalog/.
Data
The Asian Spoken Language Sampler provides a variety of speech and
transcript samples from various corpora and is designed to illustrate the variety and
breadth of the speech-related resources available from LDCs Catalog. Further information about each data set can be obtained by clicking the links in the table below.
The sample files provided in this release have been modified in various ways
relative to the original data as published by LDC:
-
most excerpts are truncated to be much shorter than the original
files, excerpt duration is typically one minute and thirty seconds
-
signal amplitude has been adjusted where necessary to normalize
playback volume
-
some corpora are published in compressed form, but all samples here
are uncompressed
-
LDC frequently uses NIST SPHERE file format for audio data, but the
audio files in this sampler have been converted to MS-WAV/audio
(RIFF) file format for compatibility with typical browser audio
utilities.
|
2005 NIST Language Recognition Evaluation
|
The goal of the NIST Language Recognition Evaluation is to establish
the baseline of current performance capability for language
recognition of conversational telephone speech and to lay the
groundwork for further research efforts in the field.
|
|
2007 NIST Language Recognition Evaluation Test Set
|
The most significant differences between previous NIST evaluations
and the 2007 task were the increased number of languages and
dialects, the greater emph asis on a basic detection task for
evaluation and the variety of evaluation conditions.
|
|
ARL Urdu Speech Database, Training Data
|
The ARL Urdu Speech Database is a collection of recorded speech from
200 adult native Urdu speakers from Pakistan and Northern India.
|
|
CALLFRIEND Farsi
|
A corpus of 60 unscripted telephone calls between friends and
acquaintances speaking in their native language, Farsi.
|
|
CALLFRIEND Tamil
|
A corpus of 60 unscripted telephone calls between friends and
acquaintances speaking in their native language, Tamil.
|
|
CALLFRIEND Vietnamese
|
A corpus of 60 unscripted telephone calls between friends and
acquaintances speaking in their native language, Vietnamese.
|
|
CALLHOME Japanese
|
A corpus of 120 unscripted telephone conversations between native
Japanese speakers and a corpus of associated transcripts.
|
|
CALLHOME Mandarin Chinese Speech
|
The Callhome Mandarin Chinese corpus of telephone speech consists of
120 unscripted telephone conversations between native speakers of
Mandarin Chinese.
|
|
JEIDA/JCSD-Channel 0 Mono Syllables
|
This collection consists of high-fidelity recordings of 150 native
speakers of Japanese each speaker produces four repetitions of 323
short prompts, including city names, control words, monosyllabic
words, isolated digits and strings of four digits. Each reading
session was recorded with two microphones.
|
|
Korean Telephone Conversations Speech and
Transcripts
|
This publication consists of 100 telephone conversations, 49 of
which were published in 1996 as Callfriend Korean, while the rest of
51 are previously unexposed calls. All 100 conversations have been
transcribed.
|
|
Mandarin Affective Speech
|
Mandarin Affective Speech is a database of emotional speech
consisting of audio recordings and corresponding transcripts
collected in 2005 at the Advance Computing and System Laboratory,
Zhejiang University. The speech database was recorded by eliciting
speakers to express different emotional states in response to
stimuli.
|
|
Russian through Switched Telephone Network (RuSTeN)
|
The purpose of the project was to develop software for automatic
identification of speakers based on voice samples acquired through
telephone channels.
|
|
TDT4 Multilingual Broadcast News Speech Corpus
|
This release contains the complete set of American English, Modern
Standard Arabic and Mandarin Chinese broadcast news audio used in
the 2002 and 2003 Topic Detection and Tracking technology
evaluations.
|
|
West Point Korean Speech
|
West Point Korean Speech is a database of digital recordings of
spoken Korean. The prompt scripts were created from 20,000 distinct
sentences, along with a subset of prompts designed to elicit free
response answers to questions for use in domain-specific translation
systems.
|
|
Fisher Levantine Arabic
|
A collection of 279 Levantine Arabic telephone conversations and
transcripts from speakers of several nationalities.
|
|
Gulf Arabic Conversational
Telephone Speech
|
Contains 975 telephone conversations from speakers across the
Persian Gulf region and their transcriptions.
|
How to Obtain
The Asian Spoken Language Sampler may be downloaded freely. The
sampler is a Gnu zipped tar file. Most compression utilities will
readily extract the sampler.
Content Copyright
Portions © 2010 Trustees of the University of Pennsylvania
|
|
|