Introduction
This lexicon contains pronunciations captured in individual
audio files for 53,602 of the most common words in English.
Data
50,892 words were chosen from LDC's CALLHOME American English Lexicon
on the basis of their frequency in the data that were used in creating
the 1994 CSR Language Model Text Corpus ("CSR-III Text Corpus,"
LDC95T6). The sources for the language model include Wall Street
Journal (1987-1994), Associated Press (1989-1991), and San Jose
Mercury News (1991); all taken from the three CD-ROM volumes of TIPSTER
(LDC93T3A). To extend the coverage of common words that happen not to
occur in the LDC corpora sampled, an additional 2,922 words
(ie. compounds, companies, places, languages, and numerals) were added
from other sources.
Each word was read by the speaker in a quiet recording studio, using a
Sennheiser HMD 410 microphone and a Sony DAT recorder. The recordings
were downsampled to 16KHz for storage on disk with the individual
lexical utterances segmented into separate waveform files, with a
consistent margin of silence on both sides of each word.
The CD-ROMs were created using the ISO-9660 Level 2 data format, along
with Rock Ridge extensions. All common computer operating systems
should be able to read the full-length file names.
Updates
There are no updates at this time.
Copyright |