Introduction
The Audiovisual Database of Spoken American English,
Linguistic Data Consortium (LDC) catalog number LDC2009V01 and isbn 1-58563-496-4,
was developed at Butler University, Indianapolis, IN in 2007 for use by a a
variety of researchers to evaluate speech production and speech recognition.
It contains approximately seven hours of audiovisual recordings of fourteen
American English speakers producing syllables, word lists and sentences used
in both academic and clinical settings.
All talkers were from the North Midland dialect region -- roughly defined as
Indianapolis and north within the state of Indiana -- and had lived in that
region for the majority of the time from birth to 18 years of age. Each participant
read 238 different words and 166 different sentences. The sentences spoken were
drawn from the following sources:
- Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J)
- Northwestern University Auditory Test No. 6 (Lists I-IV)
- Vowels in /hVd/ context (separate words)
- Texas Instruments/Massachusetts Institute for Technology (TIMIT)
sentences
The CID Everyday Sentences were created in the 1950s from a sample developed
by the Armed Forces National Research Committee on Hearing and Bio-Acoustics.
They are considered to represent everyday American speech and have the following
characteristics: the vocabulary is appropriate to adults; the words appear with
high frequency in one or more of the well-known word counts of the English language;
proper names and proper nouns are not used; common non-slang idioms and contractions
are used freely; phonetic loading and "tongue-twisting" are avoided;
redundancy is high; the level of abstraction is low; and grammatical structure
varies freely.
Northwestern University Auditory Test No. 6 is a phonemically-balanced set
of monosyllabic English words used clinically to test speech perception in adults
with hearing loss.
The /hVd/ vowel list was created to elicit all of the vowel sounds of American
English.
The TIMIT sentences are a subset (34 sentences) of the 2342 phonetically-rich
sentences read by speakers in the TIMIT
Acoustic-Phonetic Continuous Speech Corpus LDC93S1. TIMIT was designed to
provide speech data for the acquisition of acoustic-phonetic knowledge and for
the development and evaluation of automatic speech recognition systems. TIMIT
speakers were from eight dialect regions of the United States.
The Audiovisual Database of Spoken American English will be of interest in
various disciplines: to linguists for studies of phonetics, phonology, and prosody
of American English; to speech scientists for investigations of motor speech
production and auditory-visual speech perception; to engineers and computer
scientists for investigations of machine audio-visual speech recognition (AVSR);
and to speech and hearing scientists for clinical purposes, such as the examination
and improvement of speech perception by listeners with hearing loss.
Data
Participants were recorded individually during a single session. A participant
first completed a statement of informed consent and a questionnaire to gather
biographical data and then was asked by the experimenter to mark his or her
Indiana hometown on a state map. The experimenter and participant then moved
to a small, sound-treated studio where the participant was seated in front of
three navy blue baffles. A laptop computer was elevated to eye-level on a speaker
stand and placed approximately 50-60 cm in front of the participant. Prompts
were presented to the participant in a Microsoft PowerPoint presentation. The
experimenter was seated directly next to the participant, but outside the camera
angle, and advanced the PowerPoint slides at a comfortable pace.
Participants were recorded with a Panasonic DVC-80 digital video camera to
miniDV digital video cassette tapes. All participants wore a Sennheiser MKE-2060
directional/cardioid lapel microphone throughout the recordings.
Each speaker produced a total of 94 segmented files which were converted from
Final Cut Express to Quicktime (.mov) files and then saved in the appropriately
marked folder. If a speaker mispronounced a sentence or word during the recording
process, the mispronunciations were edited out of the segments to be archived.
The remaining parts of the recording, including the correct repetition of each
prompt, were then sequenced together to create a continuous and complete segment.
The fourteen participants were between 19 and 61 years of age (with a mean
age of 30 years) and native speakers of American English.
Samples
For an example of the data in this corpus, please view this video sample (Quicktime, mov).
Content Copyright
Portions © 2007 Butler University, © 1993, 2009 Trustees of the University
of Pennsylvania |