NEW! SLX Corpus of Classic Sociolinguistic Interviews now available

NEW! Presentation slides available from NWAVE2003 Workshop on Robust sociolinguistic methodology: Tools, data and best practices


1. Introduction

The project in Data and Annotations for Sociolinguistics investigates best practices in the use of digital speech corpora to address problems in sociolinguistic theory.

The quantitative study of linguistic variation is necessarily based upon empirical observation and statistical description of linguistic behavior. Collecting and annotating databases plays a crucial role in quantitative sociolinguistics. The current state of computing technology encourages the collection, annotation, analysis and even summarization and presentation of linguistic behavior wholly within the digital domain. Digital data is easily shared and that in turn encourages a whole range of positive practices. However, the use of speech corpora in sociolinguistics also raises questions both theoretical and methodological. The goal of the DASL Project is to begin to address these issues via a case study involving the analysis of a well documented sociolinguistic variable as it appears (or does not) in several large well-documented speech corpora.

1.1 Value of Shared Linguistic Data

The ability to easily share digital data encourages collaboration via:

  • the comparison of results across studies
  • the use of stable data to benchmark new or competing models and methodologies
  • the reannotation and reuse of existing data for new purposes
  • the measurement of interannotator consistency
  • the reduction of impediments facing new participants in the research community
Sharing data does not however diminish the value of ongoing data collection. Both the research and the research community benefit from new contributions. The researcher gains new skills and a unique appreciation of the subject pool while the research community gains not only a new data set but also new perspectives and new methodological approaches. The DASL Project hopes to encourage data sharing and the re-annotation and reuse of published dataas an important complement to first-hand fieldwork.

1.2 Overview of the Current Study

The Sociolinguistic Annotation project will investigate the well-documented process of t/d deletion in four large digital speech corpora: TIMIT, Switchboard-1, CallHome American English and Hub-4 English Broadcast News. These will be described below. A team of annotators will code the corpora for t/d deletion so that interannotator agreement can be measured. The interface used to conduct the annotation allows linguists to interact with the corpora, both text and speech, via the worldwide web so that this project can generalize to include multiple sites.

In addition to the empirical study of t/d deletion and the methodological questions concerning the use of speech corpora in sociolinguistics this project will address several other questions:

  • How do the corpora used in this study relate to the data most commonly used in quantitative sociolinguistics, namely recordings of sociolinguistic interviews?
  • Do the insights gained from the large scale study of a geographically diffuse subject pool differ qualitatively from speech community studies?
  • What is the rate of interannotator consistency for the task of coding t/d deletion?
  • Can studies of similar variables be organized on a large scale with teams of non-specialist annotators?
One of the most interesting issues surrounding the use of the proposed corpora for sociolinguistics is that of style. Speaking style plays an important role in the stratification of many sociolinguistic variables. A great number of quantitative studies of variation rely on data collected during sociolinguistic interviews that may combine conversation, story telling and question answering with more formal interactions such readings and word games. The corpora used herein involve somewhat different interactions. How do these interactions fit among the constellation of styles already studied? Does these interactions have correlates in everyday life? The TIMIT corpus contains over 600 speakers each reading a set of 10 phonetically rich sentences selected from a larger pool. How does this method -- and more importantly -- the resulting data -- compare to the reading selection and word lists elicitiations common in sociolinguistic interviews? The Hub-4 corpus contains many hours of broadcast news. Network anchorpeople produce most of these utterances though there are also man-on-the-street interviews. What can we learn about the role of variation in this speech that is so familiar to American TV viewers. In the Switchboard corpora, speakers participate in multiple, short, telephone conversation with each other. Does the resulting data pattern like the early part of a sociolinguistic interview when the interviewer is looking for topics of common interest? The CallHome data, perhaps the most interesting, contains 30 minute conversations among family members and close friends. Although the participants know they are being recorded, it is clear that they often forget or ignore that fact. How does this data compare to the styles already well studied in quantitative socilinguistics. DASL will provide analyses of each of the corpora. Because the data is available in digital form with transcripts, it can be annotated and analyzed for multiple variables with relative efficiency. 

2. The Variable

The DASL Project will begin with the analysis of -t/d deletion in English. -t/d deletion is a well-understood, stable variable common in multiple varieties of English. This variable shows similar patterns of stratification within many diverse speech communities in which it has been studied. The variable is easily coded, making it an idea first choice in a study where inter-annotator agreement is a focus. 

3. The Data

The data for this project come from four corpora, each created for a purpose other than sociolinguistics but capable of being reannotated to serve our purposes. The data has already been transcribed and segmented so that individual speaker turns can be retrieved separately. Within long speaker turns, individual pause groups are segmented.
 
Corpus (Click name for more info) ISBN Minutes Data Type
TIMIT 1-58563-019-5 6300 Phonetically Rich Sentences
Switchboard-1 1-58563-121-3 12000 Short Conversations with Constrained Topics among Strangers
CallHome American English 1-58563-111-6 1200 Long Conversations with Free Topics among Intimates
American English Broadcast News 1-58563-109-4 6240 Broadcast News

3.1 Sampling

Before annotation begins we first search the corpus for word of potential interest. Because the corpora are segmented at the speaker turn or pause group level, locating the speech corresponding to a -t/d token is simple. During this first effort we searched the orthographic transcripts of the speech files. As a result the queries used are perhaps more complicated than necessary. The annotation tool accepts a query on the form of a regular expression. We used a regular expression to find any consonant followed by a "t" or "d" at the end of a word where the following word does not begin with a "t" or "d". Regular expressions and English orthography do not combine perfectly. Left alone this query would return many words that we still prefer to avoid. While we could train the annotators to ignore cases that look erroneously like candidates for -t/d deletion, the time required to reject these is significant. To ameliorate this problem we use a series of filters which remove the "false hits" from consideration. In the case of TIMIT, initial queries reduced a corpus of 54,387 words down to a review list of 2059 words of which 1578 were actual -t/d tokens.

For subsequent efforts we would like to use a pronouncing lexicon as an intermediary to the search. In other words, the pronouncing lexicon would provide the list of English words susceptible to -t/d deletion. The interface would then search for those words in the transcripts which are in turn time-aligned to the audio. 

4. The Annotation

Once the corpora have been concordanced, filtered and prepared for annotation, they will be exhaustively annotated. 

4.1 Annotation Specification

The variable under study is -t/d deletion. For each token, the annotator makes judgements with respect to four factor groups: status of the dependent variable; morphological category; preceding segment and following segment. For details on the annotation specification (coding scheme) please click here.

4.2 Tools

Using a customized sociolinguistic annotation tool, users can query each corpus to select just those tokens of potential interest, greatly reducing the time needed to code data.  An interactive web-based display allows annotators to view each token, listen to the utterance and view the corresponding waveform, access demographic data and code linguistic factors. The annotator can simply click on the word to hear it spoken. Following each token, the interface displays the factors to be coded. Each factor is shown as a radio button, and coding a token entails clicking on the button corresponding to the relevant factor within each factor group. A comment field also appears after each token for the annotator to record notes. Results are easily exported to a spreadsheet or statistical analysis package. Click here for a screen shot of the main DASL interface or click here to see the applet that display and plays waveforms.
 

5. Progress and Results

The basic infrastructure for the DASL Pilot is in place and annotation is underway. The TIMIT corpus has been completely annotated for -t/d deletion; annotation of Switchboard has just begun. For details on DASL progress, please click here.

Preliminary results of the annotation of the TIMIT database are now available. Please click here to see them.
 

6. Feedback

We welcome your comments on any aspect of this project. Please click here to write the project leaders. 


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact Christopher.Cieri@ldc.upenn.edu

Contact Stephanie.Strassel@ldc.upenn.edu

Last modified: Tuesday, 27-Mar-01 16:26:30
© 2000 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.