| |||||||||||||||||||||||||
|
|
NEW! SLX Corpus of Classic Sociolinguistic Interviews now available
The project in Data and Annotations for Sociolinguistics investigates best practices in the use of digital speech corpora to address problems in sociolinguistic theory. The quantitative study of linguistic variation is necessarily based upon empirical observation and statistical description of linguistic behavior. Collecting and annotating databases plays a crucial role in quantitative sociolinguistics. The current state of computing technology encourages the collection, annotation, analysis and even summarization and presentation of linguistic behavior wholly within the digital domain. Digital data is easily shared and that in turn encourages a whole range of positive practices. However, the use of speech corpora in sociolinguistics also raises questions both theoretical and methodological. The goal of the DASL Project is to begin to address these issues via a case study involving the analysis of a well documented sociolinguistic variable as it appears (or does not) in several large well-documented speech corpora. 1.1 Value of Shared Linguistic Data The ability to easily share digital data encourages collaboration via:
1.2 Overview of the Current Study The Sociolinguistic Annotation project will investigate the well-documented process of t/d deletion in four large digital speech corpora: TIMIT, Switchboard-1, CallHome American English and Hub-4 English Broadcast News. These will be described below. A team of annotators will code the corpora for t/d deletion so that interannotator agreement can be measured. The interface used to conduct the annotation allows linguists to interact with the corpora, both text and speech, via the worldwide web so that this project can generalize to include multiple sites. In addition to the empirical study of t/d deletion and the methodological questions concerning the use of speech corpora in sociolinguistics this project will address several other questions:
2. The Variable The DASL Project will begin with the analysis of -t/d deletion in English. -t/d deletion is a well-understood, stable variable common in multiple varieties of English. This variable shows similar patterns of stratification within many diverse speech communities in which it has been studied. The variable is easily coded, making it an idea first choice in a study where inter-annotator agreement is a focus. 3. The Data The data for this project come from four
corpora, each created for a purpose other than sociolinguistics but capable
of being reannotated to serve our purposes. The data has already been transcribed
and segmented so that individual speaker turns can be retrieved separately.
Within long speaker turns, individual pause groups are segmented.
3.1 Sampling Before annotation begins we first search the corpus for word of potential interest. Because the corpora are segmented at the speaker turn or pause group level, locating the speech corresponding to a -t/d token is simple. During this first effort we searched the orthographic transcripts of the speech files. As a result the queries used are perhaps more complicated than necessary. The annotation tool accepts a query on the form of a regular expression. We used a regular expression to find any consonant followed by a "t" or "d" at the end of a word where the following word does not begin with a "t" or "d". Regular expressions and English orthography do not combine perfectly. Left alone this query would return many words that we still prefer to avoid. While we could train the annotators to ignore cases that look erroneously like candidates for -t/d deletion, the time required to reject these is significant. To ameliorate this problem we use a series of filters which remove the "false hits" from consideration. In the case of TIMIT, initial queries reduced a corpus of 54,387 words down to a review list of 2059 words of which 1578 were actual -t/d tokens. For subsequent efforts we would like to use a pronouncing lexicon as an intermediary to the search. In other words, the pronouncing lexicon would provide the list of English words susceptible to -t/d deletion. The interface would then search for those words in the transcripts which are in turn time-aligned to the audio. 4. The Annotation Once the corpora have been concordanced, filtered and prepared for annotation, they will be exhaustively annotated. 4.1 Annotation Specification The variable under study is -t/d deletion. For each token, the annotator makes judgements with respect to four factor groups: status of the dependent variable; morphological category; preceding segment and following segment. For details on the annotation specification (coding scheme) please click here. 4.2 Tools Using a customized sociolinguistic annotation
tool, users can query each corpus to select just those tokens of potential
interest, greatly reducing the time needed to code data. An interactive
web-based display allows annotators to view each token, listen to the utterance
and view the corresponding waveform, access demographic data and code linguistic
factors. The annotator can simply click on the word to hear it spoken.
Following each token, the interface displays the factors to be coded. Each
factor is shown as a radio button, and coding a token entails clicking
on the button corresponding to the relevant factor within each factor group.
A comment field also appears after each token for the annotator to record
notes. Results are easily exported to a spreadsheet or statistical analysis
package. Click
here for a screen shot of the main DASL interface or click
here to see the applet that display and plays waveforms.
5. Progress and Results The basic infrastructure for the DASL Pilot is in place and annotation is underway. The TIMIT corpus has been completely annotated for -t/d deletion; annotation of Switchboard has just begun. For details on DASL progress, please click here. Preliminary results of the annotation
of the TIMIT database are now available. Please
click here to see them.
6. Feedback We welcome your comments on any aspect of this project. Please click here to write the project leaders. |
||||||||||||||||||||||||
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data Contact Christopher.Cieri@ldc.upenn.edu Contact Stephanie.Strassel@ldc.upenn.edu Last modified: Tuesday, 27-Mar-01 16:26:30
|
|||||||||||||||||||||||||