Speech Accent Archive: Issues and Methods
Steven H. Weinberger
George Mason University
weinberg@gmu.edu

Workshop on Web-Based Language Documentation and Description
12-15 December, 2000
Philadelphia, PA

abstract
Every human language speaker has an accent. Accents, and particularly foreign
accents, have much to offer linguistic theory, for they serve as windows from which to view native grammars. Most discussions of non-native speech rely upon flat, static, paper descriptions. This is a report on an on-line web repository of digitized non-native English speech (http://classweb.gmu.edu/accent). This Speech Accent Archive contains fully accessible recorded audio samples from more than 131 non native speakers representing over 51 languages. Each sample constitutes a type of annotated signal. The samples include the digitized audio, a set of demographic characteristics about each speaker, a phonetically transcribed representation of the signal, and a set of phonological generalizations about the speech.The web site also includes a protocol for remote researchers to electronically contribute data samples to the archive. Some of the issues that will be discussed include the collection, storage, and delivery of the audio signals and their annotations. We focus upon the speech collection device, the delivery of the audio signal, and the representation of the phonetic transcription. We detail the problems involved in each of these activities and report upon our own best practices.

then they said unto him, "Say now Shibboleth," and he said, "Sibboleth," for he could not frame to pronounce it right; then they laid hold of him, and slew him at the fords of the Jordan. And there fell at that time of Ephraim forty and two thousand. (Judges 12:6)

0. Introduction
Every human who speaks a language has an accent, and every human who listens to others talk perceives an accent. This is true for both regional accents within the same language group, and for foreign accents. Human listeners tend to construct judgements about other speakers, and while the consequences of these biases are rarely as severe as that which purportedly obtained between the Gileadites and the Ephraimites, the judgements are often biased, and reveal serious discrimination in our society (Lippi-Green, 1997; Preston, 1989; Rubin 1992). Foreign accented speech is particularly susceptible to such judgements, and even from non-naive perspectives there remain some serious misunderstandings about the nature and value of foreign accent. For example, speaking with an accent has variously been viewed as a pathological condition (Chreist, 1964; Regional Rehabilitaion Hospital, 2000). From a mainstream theoretical linguistic position, it has been viewed as deficient data, somehow lacking the qualities of native language data.

Second language acquisition phonological studies, by definition, use accented speech as a major data source. The conclusions of many of these studies suggest that foreign accented speech not only contains valuable linguistic clues to a speaker's internalized native phonology, but also shows universal characteristics (Ioup and Weinberger 1987, Leather and James, 1996).

In this paper, we report on the construction of an archive that compiles and delivers annotated accented speech signals. The archive is structured to provide uniform, searchable, and annotated data to anyone doing linguistic research in accented speech. Section 1 describes the organization and of the archive, section 2 deals with the speech collection methodology, section 3 discusses the digitized audio samples, and section 4 deals with the problems of phonetic annotation.

1. The archive
The archive is located at http://classweb.gmu.edu/accent. As of 24 November, the archive contains 131 samples from 51 language backgrounds. The languages include: Afrikaans, Agny, Amharic, Arabic, Armenian, Bambara, Bengali, Bosnian, Cantonese, Czech, English, Farsi, Finnish, French, German, Greek, Gujarati, Gusii, Hebrew, Hungarian, Igbo, Italian, Japanese, Khalkha Mongol, Kiswahili, Korean, Kurdish, Lao, Latvian, Malayalam , Mandarin, Mauritian, Norwegian, Polish, Portuguese, Punjabi, Russian, Serbo-Croatian, Slovak, Somali, Spanish, Swedish, Synthesized, Taiwanese, Thai, Tibetan, Turkish, Urdu, Uzbek, Vietnamese, and Wolof. Some of the language categories, like Uzbek, have just one speaker sample, and others, like English and Spanish, have more than 10 speaker samples from different native language regions. Each speaker is recorded according to a required protocol. Each speaker reads the following paragraph:

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Subjects are asked a set of 7 demographic questions:
1. where were you born?
2. what is your native language?
3. what other languages besides English and your native language do you know?
4. how old are you?
5. how old were you when you first began to study English?
6. how did you learn English? (academically or naturalistically)
7. how long have you lived in an english-speaking country? which country?
Limiting the demographic variables to these 7 distinct items allows us to triangulate on 3 major speaker parameters: language background (nos. 1,2, and 3), age (nos. 4 and 5), and residency (nos. 6 and 7). These are precisely the parameters that contemporary second language acquisition theory finds to be most instrumental in determining speech proficiency.

Each sample has its own page. The page includes a summary report on the 7 demographic variables, a Quicktime soundtrack of the speech sample, and a phonetic transcription of the speech. Some of the samples include a link to a speaker-specific phonological generalization page. For an example of some generalizations for an Arabic speaker of English, go to http://classweb.gmu.edu/accent/generalizations/arabic1gen.html. The generalizations represent a set of speech behaviors, like r-trilling, that the speaker employs. As the archive grows, we expect that users can learn about specific accented speech behaviors by simply comparing the generalization pages.

To assist the user contextualize the linguistic data, the home page of the archive (http://classweb.gmu.edu/accent) provides external links to a linguistic atlas, a political atlas, and the International Phonetic Association.

2. Speech Collection
We began by with the choice between eliciting uniform data or eliciting natural data. Because one of the goals of the archive is to allow a comparison of different accents, we chose to construct an English paragraph to be read by every speaker. (http://classweb.gmu.edu/paragraph.html). The elicitation device has a number of requirements. It must elicit all of the English speech sounds. All of the English consonants are represented, as are the vowels.

The archive was originally designed to collect and deliver foreign accent. Therefore, the speech elicitation device was constructed to invoke particular second language phonological behaviors. For instance, the words in the paragraph contain 20 different consonant clusters in word-initial and word-final positions, known to be difficult for learners of English (such as the initial /st/, /sp/, /sk/ clusters). Nevertheless, because all English segments are included in the paragraph, the inclusion of various native English accents did not pose an elicitation problem.

There are two other requirements of the elicitation paragraph: it must be short, so that the audio delivery over limited bandwidths can be accomplished within some reasonable time, and it must contain common words, so that reading interference is kept to a minimum. The paragraph contains 69 words. All are common words in English (except perhaps "slabs"). Most readers complete the paragraph reading within 50 seconds.

Uniformity remains an important objective of the archive. But the growth and size of the archive database cannot be hindered by the data gathering techniques. To maintain uniformity and growth potential, we developed a strict protocol for data collection that is used by graduate student researchers. The data gathering is linked to 2 graduate level courses at George Mason University: a phonetics course and a fieldwork in applied linguistics course. Students are trained in audio recording techniques, and in phonetic transcription. We utilize portable digital mini-cd recorders for the audio capture. There is also a web-based data submission page that allows researchers from anywhere to send in data samples. The submission page (http://classweb.gmu.edu/accent/nethermail/submit.html) recapitulates the precise protocol that our local students must follow. We require that remote researchers contact us prior to their data collection, but legitimacy concerns remain potentially real.

3. The audio samples
The intent of the archive is to deliver high quality sound over limited bandwidth with a maximum degree of user control. We find that delivering the audio as a Quicktime sound track meets these requirements (http://www.apple.com/quicktime/). The sound track can be placed inline so that the user does not have to leave the page to play the audio. The sound can be stopped, started, slowed down or speeded up by the user. The playback control panel allows immediate access to any portion of the sound track. This allows listeners to rapidly play crucial sections of the soundtrack over and over again. This is particularly useful for checking a phonetic transcription. Quicktime also accommodates a wide variety of codecs. It is cross-platform and it is free.

The bulk of our samples are recorded on Sony MD R-70 minidisk digital recorders with Sony microphones. The portability and recording quality of these devices determined our choice. The recordings are then transferred to an iMac for digital editing and compression. The recording is sampled at 44.1kHz. 16-Bit mono. The software package used is SndSampler, a shareware program with minimal but suitable sound editing capabilities. The sample is normalized at 92% and saved as an AIFF file. This file is then converted into a Quicktime movie soundtrack. We use Quicktime Pro to do the final compression. We shrink the file by a factor of 8 by compressing at 22.05kHz., 16-bit mono with IMA 4:1 compression. We tested half a dozen compression schemes and this codec appears to be the most efficient while maintaining high quality playback. With this compression, our current sample sizes range from 668k (spa12) (57.19 sec.) to 234k (english3) (17.29 sec.). The 131 compressed sound samples take up 47 megabytes of space. (For a list of codecs and explanations, go to http://www.terran.com/CodecCentral/Codecs/index.html)

4. Phonetic transcriptions
The phonetic representations included with each speech sample are truly the most labor-intensive and problem-prone component in the archive. Each speech sample must be transcribed by 2-4 phonetically trained transcribers, with the final representation reached by deliberated consensus. Graduate assistants and the principle investigator typically do the transcriptions. We are following the 1996 version of the International Phonetic Alphabet (IPA, 1999), (http://www.arts.gla.ac.uk/IPA/ipa.html). Our phonetic transcriptions are generally narrow ones--we make minimal assumptions about the phonemic structure of the native languages. Nevertheless it is assumed that phonetic transcription does not always proceed in a theoretical vacuum (Laver, 1994, p. 3). There are instances when our phonetic judgements are affected by our knowledge of contrastive analysis.

The transcriptions concentrate on segmentals, and do not deal with stress or tone. Even though most speakers produce continuous speech, we arbitrarily leave spaces between each word for readability. We also add extra spaces to indicate pauses. An example of one of the transcriptions is given here for English1: http://classweb.gmu.edu/accent/ipagifs/english.GIF

The font used here is called IPAphon (http://www.chass.utoronto.ca:8080/~rogers/fonts.html), created by Henry Rogers. It comes in Macintosh and PC versions.

The most persistent problems are encountered when we attempt to share documents with the IPA font. Students who use different PC versions of Microsoft Word cannot exchange and read the font with Macintosh users of Word. Even when both platforms have the latest versions of various word processors, there is difficulty. Converting to HTML does not always help the situation. The font does not seem to translate. There appears no easy solution for representing an IPA transcription on the web with a suitable and easily attainable font. And as far as we can tell, there is as yet no adequate unicode IPA font that all users could use on all platforms (http://www.unicode.org/).

We instead bypass the problem and simply convert each IPA text document into a GIF image. We do this easily with WordPerfect 3.5 for the Macintosh, and GraphicConverter for the Macintosh. These IPA GIF images are complete transcriptions. They can be read by any browser. But when the transcription needs to be modified, the text version must be edited and a new GIF image must be constructed. Until we find a better solution for this font translation problem, our graduate assistants are being given Macintosh laptops with WordPerfect and the IPAphon installed.

5. Conclusions
During our construction of this archive we have been confronted with various tensions. There is the tension between uniformity and database size, the signal tension between quality and bandwidth size, and the transcription tension between phonetic narrowness and theoretical relevance. Each of these tensions has been dealt with by choosing some balanced point on each continuum.

There are still unresolved issues. For example, we have yet to determine the point at which the archive will be complete and representative. How many Spanish samples are required to represent the Spanish language? Which English dialects should be included? And which native English variety should be considered to be the archetypal variety? This is a crucial decision, since all of the generalization pages are based upon the answer.

Notwithstanding these problems, the Speech Accent Archive remains a free and growing source for many types of users including:
a. esl teachers who instruct non-native speakers of English
b. actors who need to learn an accent
c. engineers who train speech recognition machines
d. linguists who do research on foreign accent
e. anyone who finds foreign accent to be interesting

References
Chreist, F. (1964). Foreign Accent. Englewood Cliffs: Prentice-Hall.

Ioup, G., and Weinberger, S. (Eds.). (1987). Interlanguage Phonology. Cambridge, MA: Newbury House.

International Phonetic Association. (1999). Handbook of the International Phonetic Association. Cambridge: Cambridge University Press.

Laver, J. (1994). Principles of Phonetics. Cambridge: Cambridge University Press.

Leather, J., and James, A. (1996). Second Language Speech. In Ritchie, W., and Bhatia, T. (Eds.) , Handbook of Second Language Acquisition. San Diego: Academic Press.

Lippi-Green, R. (1997). English with an Accent. London: Routledge.

Preston, D. (1989). Perceptual Dialectology. Dordrecht: Foris.

Regional Rehabilitaion Hospital. (2000). Speech Accent Modification Program. (Flyer).

Rubin, D. (1992). Nonlanguage Factors Affecting Undergraduates' Judgements of Non-native English Speaking Teaching Assistants. Research in Higher Education 33, 511-531.