John F. Pitrelli, Cynthia Fong, Hong C. Leung
April 20, 1995
PhoneBook is a phonetically-rich, isolated-word, telephone-speech database, created because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated continued importance of isolated-word and keyword-spotting technology to speech-recognition-based applications over the telephone, and (3) findings that continuous-speech training data is inferior to isolated-word training for isolated-word recognition.
The goal of PhoneBook is to serve as a large database of American English word utterances incorporating all phonemes in as many segmental/stress contexts as are likely to produce coarticulatory variations, while also spanning a variety of talkers and telephone transmission characteristics. We anticipate that it will be useful in ways analogous to TIMIT/NTIMIT.
The core section of PhoneBook consists of a total of 93,667 isolated-word utterances, totalling 23 hours of speech. This breaks down to 7979 distinct words, each said by an average of 11.7 talkers, with 1358 talkers each saying up to 75 words. All data were collected in 8-bit mu-law digital form directly from a T1 telephone line. Talkers were adult native speakers of American English chosen to be demographically representative of the U.S.
Given the large set of talkers being recruited for PhoneBook database, it made sense to exploit the opportunity to collect additional utterances. We have chosen spontaneous numerical utterances, because of widespread interest in them and the need for very large numbers of talkers for research into spontaneous-speech effects. We restricted to just three spontaneous digit sequences and one money amount, as the lists for the core of PhoneBook have been designed to approach the limit of reasonable duration for a caller's session. As a result, PhoneBook contains a total of 5081 spontaneous utterances.
The following outline summarizes the development of PhoneBook, as well as the remainder of this document.
The design of the phonetically-rich word list is central to the PhoneBook. The goal was to make the word list as compact as possible while
We began with the largest machine-readable dictionary sources available, primarily the 99,000-word CMU dictionary. We eliminated entries with multiple words or with foreign phonemes, and reconciled the remaining entries' phonemic representations into a common 42-phoneme inventory:
i bEAt Y bIte n Neat D THy I bIt O bOY m Meet p Pea e bAIt W bOUt G siNG t Tea E bEt R bIRd h Heat k Key @ bAt x sofA s See b Bee a bOb X buttER S SHe d Day c bOUGHt L bottLE f Fee g Geese o bOAt l Let T THigh C CHurCH ^ bUt w Wet z Zoo J JuDGe u bOOt r Red Z meaSure U bOOk y Yet v VanVowels are marked with three levels of lexical stress -- "1" for primary, "2" for secondary, and no marking for reduced. These symbols, as well as "#" for word boundary, will be used in this document.
B. Word filtering
The purpose of the large dictionary was to be a source from which to draw words for a reading task in which obtaining a particular pronunciation from a naive speaker was of key importance. This purpose presumably diverges from the goals of dictionary makers; consequently, several categories of problematic words had to be filtered out:
C. Phonemic/stress context enumeration criteria
Our phoneme sequence enumeration is based both on common practice in the speech recognition community and on acoustic-phonetic and articulatory knowledge. Most speech recognition research approximates phonemic contexts using triphones as equivalence classes, implicitly assuming that the primary coarticulatory effects on a phoneme are related to only the one preceding and one following phoneme. Some coarticulatory effects, however, reach beyond adjacent phonemes or are otherwise not covered by traditional triphone inventories. For example, the /u/ in "strewn" can cause some degree of anticipatory rounding throughout the /str/ sequence. Furthermore, lexical stress influences the acoustics of both vowels and consonants, but is typically not accounted for completely and consistently by triphone enumeration. Finally, the position within the syllable of a phoneme can affect the phoneme's articulation and acoustics.
For these reasons we enumerate phonemic contexts in terms of a simple syllable template, defining three "syllable parts" -- the "onset", consisting of all consonants preceding the vowel; the "nucleus", consisting of the vowel, a mark indicating one of three stress levels, and one postvocalic liquid if any; and the "coda", consisting of any remaining postvocalic consonants. However, to avoid reliance on the often-ill-defined syllable boundary within a sequence of consonants, we treat a coda-onset sequence as a single "part", "consonant sequence", with no loss of specificity in each enumerated context.
Our inventory of phonemes-in-context contains all distinct sequences of two such syllable parts found in a dictionary. We also include a triphone inventory, due to the widespread interest in triphones in the recognition community. Stress marks were not considered to be part of a phone for purposes of determining the triphone inventory, though some stress distinctions are implicitly represented in the vowel set, such as /I/ vs. /|/, while others are not, such as stressed /i/ vs. unstressed /i/. Beginning- and end-of-word were considered to be syllable parts and phonemes for purposes of developing the two-syllable-part-sequence and triphone inventories, respectively. Henceforth we refer to two-syllable-part sequences and triphones in aggregate as "contexts".
Working with the candidate word list, context enumeration yielded 10,839 triphones and 9458 two-syllable-part sequences.
D. Word-list-generation algorithm
We then employed a greedy algorithm to extract a final word list as compact as possible from the candidate word list, while still covering our entire inventory of contexts. Our algorithm, which is similar to one described by Kassel (ICSLP 1994, pp. 1827-1830), is:
The software which generates the word list from the candidate list is included in this distribution.
A PC-based recording platform was developed in our lab for this and other similar data collections. The platform terminates a T1 trunk, enabling collection on up to 24 channels simultaneously. All data were digitally recorded by the platform in the T1 line's 8-bit mu-law format, with no analog conversion.
Four channels of a T1 span were used for PhoneBook. Talkers were given a toll-free number connected to a hunt group for these channels. Talkers used their own telephone handsets to call our system. The system was available for calling 21 hours a day (6 AM to 3 AM Eastern Time), 7 days a week.
The 7979 words were divided into 106 disjoint lists, with 29 containing 76 words each and the remaining 77 containing 75 each. These lists were padded to 79 read words by adding three or four isolated read digits (not included with PhoneBook) onto the beginning of each list, so that talkers could adjust to the flow of the session before reaching the words of interest. Callers were given three seconds for each of these first 79 items.
The 80th item was a 9-digit number printed on the list; this number identified which of the 106 lists was being read, and it was formatted like a Social Security number (NNN-NN-NNNN). Callers were given seven seconds for this item. The remaining prompts were for spontaneous speech. The first seven were for information about the caller (utterances not included with PhoneBook), the next three elicited digit sequences, and the final one requested a money amount. The digit utterances are a ZIP code, intended to be five digits but could be nine, for which talkers were given four seconds; a telephone number, intended to be seven digits but could be eight, 10 or 11, for which five seconds were allowed; and a telephone number including an area code, intended to be 10 or 11 digits, for which eight seconds were allowed. Eight seconds were also provided for the money amount. All told, the call lasted approximately eight minutes. Following are the exact initial greeting, the 91 prompts, and the sign-off.
Thank you for calling the NYNEX Speech Database System. For each item on your list you will hear the item number followed by a beep. After the beep, please wait briefly and then say the item.
Please say item 1.
Please say item 2.
Item 3.
Item 4.
Item 5.
Item 6.
7.
8.
(etc., through 80.)
What is your age?
Are you male or female?
What states and foreign countries did you live in before age 21?
What is your first language?
What languages were spoken at home while you were growing up?
If you have any speech or hearing problems, please mention them now.
Please say your panel ID number.
Please say a ZIP code.
Please say a phone number.
Please say a phone number that includes an area code.
Please say an amount of money.
This concludes your session. Thank you for participating. Good-bye.
Talkers attempting to call off hours heard the message "Thank you for calling the NYNEX Speech Database System. Our system is operational 21 hours a day but not between 3 AM and 6 AM Eastern Time. Please call back during our operational hours. Thank you. Good-bye."
We contracted with The NPD Group, Inc., a marketing research firm which maintains a diverse panel of households in the 50 United States, to provide us a demographically-balanced sample of adult American talkers. Our specification was that they send out lists in such a way that the demographic variations in response rate would generate for us a set of at least 1060 talkers (ten per list), which is, in order of priority:
Four mailings totaling 7683 letters were mailed out as follows:
4/ 8/94 1272 4/29/94 2650 7/ 1/94 2035 8/12/94 990 9/ 2/94 736.While we enjoyed a 44% response rate, higher than anticipated, we suffered a 59% talker rejection rate, far in excess of what others have found in the past, owing to the difficulty of the material and the greater scrutiny to be applied to the utterances -- verifying a particular pronunciation for each word rather than merely any accepted one. Thus, we needed this many letters to reach 1358 callers, and 93,667 isolated-word utterances. We went beyond our initial goal in an attempt to obtain at least 10 utterances of as many words as possible.
For read isolated words, any silence exceeding 300 ms on either side of the segment the transcribers marked as speech was eliminated. Spontaneous- speech files were not trimmed.
Filenames for the isolated words are of the format:
u<xxxx>_<gender>_<orthography>.wav
where <xxxx> is a four-digit unique identification number for the talker, <gender> is "m" or "f", and <orthography> is the orthography of the word, in all lower-case, with apostrophes replaced by "+"s.
Filenames for the spontaneous utterances are of the format:
u<xxxx>_<gender>_<type>.wav
where <xxxx> and <gender> are as above, and <type> is "zip_code", "phone_num", "long_dist" or "money".
Each utterance file has a 1024-byte SPHERE header formatted as follows (comments within [] are not part of header):
Isolated words:
database_id PHONEBOOK
sample_rate 8000
list_number 602 [all even numbers from 602 to 812]
orthography staying
phon_trans st1exG
[See section 1C. The following entry indicates a
word-boundary/consonant-sequence sequence /st/, a
consonant-sequence/syllable-nucleus sequence /ste/ with the /e/ having primary
stress, a nucleus/nucleus sequence primary-stressed /e/ followed by schwa, a
nucleus/consonant-sequence sequence schwa followed by
/G/, a consonant-sequence/word-boundary sequence /G/, a "triphone" consisting
of word-initial /st/, triphones /ste/, /tex/, /exG/, and a "triphone"
consisting of word-final /xG/. This line can reach ~200 characters for a
long word.]
phon_reason BC-st,CN-st1e,NN-1ex,NC-xG,CB-G,T-#st,T-ste,T-tex,T-exG,T-xG#
sample_count 11779
sample_n_bytes 1 [1 byte per sample]
channel_count 1 [recorded on channel 1 of recording platform]
sample_coding ulaw [mu-law representation of waveform]
sample_byte_format 1
sample_sig_bits 8 [8 significant bits in a sample]
sample_checksum 17834
database_version 1.0
begin_time 0.300000 [speech begins 0.3 sec. into the provided waveform]
end_time 1.172000 [speech ends 1.172 sec. into the provided waveform]
Spontaneous speech:
database_id PHONEBOOK
database_version 1.0
sample_count 64000
sample_n_bytes 1
channel_count 1
sample_byte_format 1
sample_rate 8000
sample_coding ulaw
sample_checksum 34600
orthography (sigh)_an_amount_of_money_(sigh)_(lgh)_(ext)
begin_time 0.548750
end_time 1.924750
While PhoneBook shares many utterance verification criteria with other speech data collection efforts, one criterion which for this project stands out in importance and, consequently, detail, is verification of correct pronunciation. Central to PhoneBook fulfilling its purpose as a phonetically-rich training database is our ability to capture legitimate American English PHONETIC and PHONOLOGICAL VARIATIONS that different speakers produce as realizations of the SAME PHONEMIC target. Thus, we are looking to capture regional/dialect variabilities such as how often talkers reduce a schwa-nasal sequence to a syllabic nasal, how often and to what degree talkers reduce an intervocalic alveolar stop into a flap, what "vowel color" talkers produce for syllable nuclei such as /c/ and /ar/, etc., while avoiding conflating these variations with phonemic variations such as whether the second syllable in "tomato" has an /e/ or an /a/.
Therefore, as described in section 1B, every effort was made to exclude words with multiple commonly-accepted phonemic sequences. Transcribers were educated as to the difference between phonemic and phonetic variation, and were provided with our anticipated phonemic transcription of each word at verification time. They were told to reject a word on the grounds of mispronunciation only in the event that what they heard did not represent a reasonable phonetic realization of the phoneme sequence they saw. By this criterion, when an unanticipated alternate phonemic form was discovered at verification time, a reasonably-pronounced utterance was rejected as "mispronunced" because it did not represent the phoneme sequence for which its word was included into PhoneBook.
A. Talker-rejection criteria
Transcribers rejected a talker upon detecting any of the following:
B. Read-isolated-word-utterance rejection criteria
Transcribers rejected a core PhoneBook utterance upon detecting any of:
Transcribers rejected a spontaneous utterance only when the recording did not contain speech.
D. Spontaneous-utterance transcription procedure
Transcribers were told to type exactly what was said, all in lower case, spelling out numbers without hyphens, and using no punctuation except for apostrophes when the talker said a contraction or a possessive.
When a portion of a word was unclear, it was transcribed within square brackets if determinable ("twen[ty]") or labeled as "[?]" if not.
When background noise(s) overlapped speech, "(ext-b)" was marked before the first affected word and "(ext-e)" was marked after the last affected word for each contiguous passage of noise-affected speech regardless of how many noise sources were involved or the degree to which they overlapped each other.
Markers for other speech and non-speech events are as follows:
(uni) unintelligible speech (other than fragments of
partially-recognizable words)
(?word) speech was not definitely intelligible but was probably "word"
(for) foreign speech
(uh) oral hesitation filler
(mm) nasal hesitation filler
(um) oral-then-nasal hesitation filler
(obs) obscenity
(lgh) laughing
(thr) throat clearing, coughing or sniffle
(sigh) sighing or loud breathing
(esp) speech produced by the talker but not directed to telephone
(hang-up) hang-up clicks