Workshop on Web-Based Language Documentation and Description
12-15 December, 2000
Philadelphia, PA
abstract
Every human language speaker has an accent. Accents, and particularly foreign
accents, have much to offer linguistic theory, for they serve as windows from
which to view native grammars. Most discussions of non-native speech rely upon
flat, static, paper descriptions. This is a report on an on-line web repository
of digitized non-native English speech (http://classweb.gmu.edu/accent).
This Speech Accent Archive contains fully accessible recorded audio samples
from more than 131 non native speakers representing over 51 languages. Each
sample constitutes a type of annotated signal. The samples include the digitized
audio, a set of demographic characteristics about each speaker, a phonetically
transcribed representation of the signal, and a set of phonological generalizations
about the speech.The web site also includes a protocol for remote researchers
to electronically contribute data samples to the archive. Some of the issues
that will be discussed include the collection, storage, and delivery of the
audio signals and their annotations. We focus upon the speech collection device,
the delivery of the audio signal, and the representation of the phonetic transcription.
We detail the problems involved in each of these activities and report upon
our own best practices.
0. Introduction
Every human who speaks a language has an accent, and every human who listens
to others talk perceives an accent. This is true for both regional accents within the
same language group, and for foreign accents. Human listeners tend to construct
judgements about other speakers, and while the consequences of these biases are
rarely as severe as that which purportedly obtained between the Gileadites and the
Ephraimites, the judgements are often biased, and reveal serious discrimination in
our society (Lippi-Green, 1997; Preston, 1989; Rubin 1992). Foreign accented speech is
particularly susceptible to such judgements, and even from non-naive perspectives
there remain some serious misunderstandings about the nature and value of
foreign accent. For example, speaking with an accent has variously been viewed as a
pathological condition (Chreist, 1964; Regional Rehabilitaion Hospital, 2000). From
a mainstream theoretical linguistic position, it has been viewed as deficient data,
somehow lacking the qualities of native language data.
Second language acquisition phonological studies, by definition, use accented speech as a major data source. The conclusions of many of these studies suggest that foreign accented speech not only contains valuable linguistic clues to a speaker's internalized native phonology, but also shows universal characteristics (Ioup and Weinberger 1987, Leather and James, 1996).
In this paper, we report on the construction of an archive that compiles and delivers
annotated accented speech signals. The archive is structured to provide uniform,
searchable, and annotated data to anyone doing linguistic research in accented
speech. Section 1 describes the organization and of the archive, section 2 deals with
the speech collection methodology, section 3 discusses the digitized audio samples,
and section 4 deals with the problems of phonetic annotation.
1. The archive
The archive is located at http://classweb.gmu.edu/accent.
As of 24 November, the archive contains 131 samples from 51 language backgrounds.
The languages include: Afrikaans, Agny, Amharic, Arabic, Armenian, Bambara,
Bengali, Bosnian, Cantonese, Czech, English, Farsi, Finnish, French, German,
Greek, Gujarati, Gusii, Hebrew, Hungarian, Igbo, Italian, Japanese, Khalkha
Mongol, Kiswahili, Korean, Kurdish, Lao, Latvian, Malayalam , Mandarin, Mauritian,
Norwegian, Polish, Portuguese, Punjabi, Russian, Serbo-Croatian, Slovak, Somali,
Spanish, Swedish, Synthesized, Taiwanese, Thai, Tibetan, Turkish, Urdu, Uzbek,
Vietnamese, and Wolof. Some of the language categories, like Uzbek, have just
one speaker sample, and others, like English and Spanish, have more than 10
speaker samples from different native language regions. Each speaker is recorded
according to a required protocol. Each speaker reads the following paragraph:
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Each sample has its own page. The page includes a summary report on the 7 demographic variables, a Quicktime soundtrack of the speech sample, and a phonetic transcription of the speech. Some of the samples include a link to a speaker-specific phonological generalization page. For an example of some generalizations for an Arabic speaker of English, go to http://classweb.gmu.edu/accent/generalizations/arabic1gen.html. The generalizations represent a set of speech behaviors, like r-trilling, that the speaker employs. As the archive grows, we expect that users can learn about specific accented speech behaviors by simply comparing the generalization pages.
To assist the user contextualize the linguistic data, the home page of the archive (http://classweb.gmu.edu/accent) provides external links to a linguistic atlas, a political atlas, and the International Phonetic Association.
2. Speech Collection
We began by with the choice between eliciting uniform data or eliciting
natural data. Because one of the goals of the archive is to allow a comparison
of different accents, we chose to construct an English paragraph to be read
by every speaker. (http://classweb.gmu.edu/paragraph.html).
The elicitation device has a number of requirements. It must elicit all of the
English speech sounds. All of the English consonants are represented, as are
the vowels.
The archive was originally designed to collect and deliver foreign accent. Therefore, the speech elicitation device was constructed to invoke particular second language phonological behaviors. For instance, the words in the paragraph contain 20 different consonant clusters in word-initial and word-final positions, known to be difficult for learners of English (such as the initial /st/, /sp/, /sk/ clusters). Nevertheless, because all English segments are included in the paragraph, the inclusion of various native English accents did not pose an elicitation problem.
There are two other requirements of the elicitation paragraph: it must be short, so that the audio delivery over limited bandwidths can be accomplished within some reasonable time, and it must contain common words, so that reading interference is kept to a minimum. The paragraph contains 69 words. All are common words in English (except perhaps "slabs"). Most readers complete the paragraph reading within 50 seconds.
Uniformity remains an important objective of the archive. But the growth and size of the archive database cannot be hindered by the data gathering techniques. To maintain uniformity and growth potential, we developed a strict protocol for data collection that is used by graduate student researchers. The data gathering is linked to 2 graduate level courses at George Mason University: a phonetics course and a fieldwork in applied linguistics course. Students are trained in audio recording techniques, and in phonetic transcription. We utilize portable digital mini-cd recorders for the audio capture. There is also a web-based data submission page that allows researchers from anywhere to send in data samples. The submission page (http://classweb.gmu.edu/accent/nethermail/submit.html) recapitulates the precise protocol that our local students must follow. We require that remote researchers contact us prior to their data collection, but legitimacy concerns remain potentially real.
3. The audio samples
The intent of the archive is to deliver high quality sound over limited
bandwidth with a maximum degree of user control. We find that delivering the
audio as a Quicktime sound track meets these requirements (http://www.apple.com/quicktime/).
The sound track can be placed inline so that the user does not have to leave
the page to play the audio. The sound can be stopped, started, slowed down or
speeded up by the user. The playback control panel allows immediate access to
any portion of the sound track. This allows listeners to rapidly play crucial
sections of the soundtrack over and over again. This is particularly useful
for checking a phonetic transcription. Quicktime also accommodates a wide variety
of codecs. It is cross-platform and it is free.
The bulk of our samples are recorded on Sony MD R-70 minidisk digital recorders
with Sony microphones. The portability and recording quality of these devices
determined our choice. The recordings are then transferred to an iMac for digital
editing and compression. The recording is sampled at 44.1kHz. 16-Bit mono. The
software package used is SndSampler, a shareware program with minimal but suitable
sound editing capabilities. The sample is normalized at 92% and saved as an
AIFF file. This file is then converted into a Quicktime movie soundtrack. We
use Quicktime Pro to do the final compression. We shrink the file by a factor
of 8 by compressing at 22.05kHz., 16-bit mono with IMA 4:1 compression. We tested
half a dozen compression schemes and this codec appears to be the most efficient
while maintaining high quality playback. With this compression, our current
sample sizes range from 668k (spa12) (57.19 sec.) to 234k (english3) (17.29
sec.). The 131 compressed sound samples take up 47 megabytes of space. (For
a list of codecs and explanations, go to http://www.terran.com/CodecCentral/Codecs/index.html)
4. Phonetic transcriptions
The phonetic representations included with each speech sample are truly
the most labor-intensive and problem-prone component in the archive. Each speech
sample must be transcribed by 2-4 phonetically trained transcribers, with the
final representation reached by deliberated consensus. Graduate assistants and
the principle investigator typically do the transcriptions. We are following
the 1996 version of the International Phonetic Alphabet (IPA, 1999), (http://www.arts.gla.ac.uk/IPA/ipa.html).
Our phonetic transcriptions are generally narrow ones--we make minimal assumptions
about the phonemic structure of the native languages. Nevertheless it is assumed
that phonetic transcription does not always proceed in a theoretical vacuum
(Laver, 1994, p. 3). There are instances when our phonetic judgements are affected
by our knowledge of contrastive analysis.
The transcriptions concentrate on segmentals, and do not deal
with stress or tone. Even though most speakers produce continuous speech, we
arbitrarily leave spaces between each word for readability. We also add extra
spaces to indicate pauses. An example of one of the transcriptions is given
here for English1:
The font used here is called IPAphon (http://www.chass.utoronto.ca:8080/~rogers/fonts.html), created by Henry Rogers. It comes in Macintosh and PC versions.
The most persistent problems are encountered when we attempt to share documents with the IPA font. Students who use different PC versions of Microsoft Word cannot exchange and read the font with Macintosh users of Word. Even when both platforms have the latest versions of various word processors, there is difficulty. Converting to HTML does not always help the situation. The font does not seem to translate. There appears no easy solution for representing an IPA transcription on the web with a suitable and easily attainable font. And as far as we can tell, there is as yet no adequate unicode IPA font that all users could use on all platforms (http://www.unicode.org/).
We instead bypass the problem and simply convert each IPA text document into a GIF image. We do this easily with WordPerfect 3.5 for the Macintosh, and GraphicConverter for the Macintosh. These IPA GIF images are complete transcriptions. They can be read by any browser. But when the transcription needs to be modified, the text version must be edited and a new GIF image must be constructed. Until we find a better solution for this font translation problem, our graduate assistants are being given Macintosh laptops with WordPerfect and the IPAphon installed.
5. Conclusions
During our construction of this archive we have been confronted with various
tensions. There is the tension between uniformity and database size, the signal
tension between quality and bandwidth size, and the transcription tension between
phonetic narrowness and theoretical relevance. Each of these tensions has been
dealt with by choosing some balanced point on each continuum.
There are still unresolved issues. For example, we have yet to determine the point at which the archive will be complete and representative. How many Spanish samples are required to represent the Spanish language? Which English dialects should be included? And which native English variety should be considered to be the archetypal variety? This is a crucial decision, since all of the generalization pages are based upon the answer.
Notwithstanding these problems, the Speech Accent Archive remains a free and
growing source for many types of users including:
a. esl teachers who instruct non-native speakers of English
b. actors who need to learn an accent
c. engineers who train speech recognition machines
d. linguists who do research on foreign accent
e. anyone who finds foreign accent to be interesting
References
Chreist, F. (1964). Foreign Accent. Englewood Cliffs: Prentice-Hall.
Ioup, G., and Weinberger, S. (Eds.). (1987). Interlanguage Phonology. Cambridge,
MA: Newbury House.
International Phonetic Association. (1999). Handbook of the International
Phonetic Association. Cambridge: Cambridge University Press.
Laver, J. (1994). Principles of Phonetics. Cambridge: Cambridge University
Press.
Leather, J., and James, A. (1996). Second Language Speech. In Ritchie, W., and
Bhatia, T. (Eds.) , Handbook of Second Language Acquisition. San Diego:
Academic Press.
Lippi-Green, R. (1997). English with an Accent. London: Routledge.
Preston, D. (1989). Perceptual Dialectology. Dordrecht: Foris.
Regional Rehabilitaion Hospital. (2000). Speech Accent Modification Program.
(Flyer).
Rubin, D. (1992). Nonlanguage Factors Affecting Undergraduates' Judgements of
Non-native English Speaking Teaching Assistants. Research in Higher Education
33, 511-531.