Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



American English Spoken Lexicon

Item Name: American English Spoken Lexicon
Authors: Amanda Hallie Seidl-Friedman, Masato Kobayashi, and Christopher Cieri
LDC Catalog No.: LDC99L23
ISBN: 1-58563-156-6
Data Type: lexicon
Data Source(s): microphone speech
Project(s): EARS, GALE
Application(s): speech recognition
Language(s): English
Language ID(s): ENG
Distribution: 4 CD
Member fee: $0 for 1999 members
Non-member Fee: US$500.00
Reduced-License Fee: US$500.00
Extra-Copy Fee: US$500.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Amanda Hallie Seidl-Friedman, Masato Kobayashi, and Christopher Cieri
1999
American English Spoken Lexicon
Linguistic Data Consortium, Philadelphia

Introduction

This lexicon contains pronunciations captured in individual audio files for 53,602 of the most common words in English.

Data

50,892 words were chosen from LDC's CALLHOME American English Lexicon on the basis of their frequency in the data that were used in creating the 1994 CSR Language Model Text Corpus ("CSR-III Text Corpus," LDC95T6). The sources for the language model include Wall Street Journal (1987-1994), Associated Press (1989-1991), and San Jose Mercury News (1991); all taken from the three CD-ROM volumes of TIPSTER (LDC93T3A). To extend the coverage of common words that happen not to occur in the LDC corpora sampled, an additional 2,922 words (ie. compounds, companies, places, languages, and numerals) were added from other sources.

Each word was read by the speaker in a quiet recording studio, using a Sennheiser HMD 410 microphone and a Sony DAT recorder. The recordings were downsampled to 16KHz for storage on disk with the individual lexical utterances segmented into separate waveform files, with a consistent margin of silence on both sides of each word.

The CD-ROMs were created using the ISO-9660 Level 2 data format, along with Rock Ridge extensions. All common computer operating systems should be able to read the full-length file names.

Updates

There are no updates at this time.

Copyright


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.