Two Approaches to Linguistic Field Work on the Web:
The TELL and Ingush Projects

Ronald Sprouse, UC Berkeley

Department of Linguistics, 1203 Dwinelle Hall
UC Berkeley, Berkeley, CA 94720-2650, USA
ronald@uclink.berkeley.edu


I will describe two very different on-line projects at UC Berkeley, the TELL and Ingush projects:

The goal of the Turkish Electronic Living Lexicon project (TELL; funded by NSF grant #95-14355 to PI Sharon Inkelas) is to create a publicly-accessible database of Turkish lexical items which reflects actual speaker knowledge, rather than the conservative, normative and phonologically incomplete dictionary representations on which most of the existing literature is based. The promise of this database is to allow existing claims about Turkish to be examined rigorously and to be expunged where they are faulty, and to provide facilities for uncovering significant new generalizations. The database currently includes slightly more than 30,000 phonologically-unique headwords drawn from a variety of sources - dictionaries, atlases and texts.

We have developed a web interface for the database that permits searching on the basis of a variety of criteria: phonological form (including segmental and/or prosodic properties such as stress and syllable structure), orthographic form, morphological form, and etymological origin. Multiple searches may be performed simultaneously in order to extract items that undergo particular morphophonological alternations. In addition, a special class of metacharacters using familiar notation has been constructed to make the database more accessible to linguists. For example, 'C' represents any consonant and 'V' any vowel. These metacharacters are much easier to use than their corresponding regular expressions.

Currently, the TELL database includes:

In the next phase of the project, the database will be expanded along several dimensions. Etymologies and morphological parses will be filled in as much as possible. Two additional speakers will be recorded and transcriptions of all speakers will be made available on-line. In addition, digitized recordings of the speakers will be a short hyperlink away from the search results, allowing researchers to analyze word forms matching a specific phonological pattern and to verify TELL transcriptions for themselves.

The Ingush project (funded by NSF grant #96-16448 to PI Johanna Nichols) has a more traditional goal - to produce a descriptive grammar and dictionary of Ingush, along with a corpus of carefully annotated texts.

As part of our project I have developed a web-enabled program, the Berkeley Interlinear Text collector (BITC), which is a system for collecting interlinear texts and is especially designed for group collaboration. BITC is installed on a network server, and users access BITC through a web browser. This is ideal for group work, since, once the server is set up, practically anyone can get involved with the project without having to install special software. Also, everyone benefits from the texts collected by others involved with the project because all work contributes to a shared dictionary. Using the Internet potentially enables geographically-dispersed researchers to collaborate with each other, too. A typical BITC entry screen looks like the screenshot below (click on image for larger version):

Sample BITC screenshot

When a user enters a new record, words are automatically entered into separate word fields (labeled 'ING'), and the shared dictionary is consulted for already-existing glosses for each of the words. For example, the word 'xa' has been glossed previously as both 'watch' and 'time' in the example screen above. These glosses are displayed as a list from which the user may choose to gloss the word, or the user may enter a new gloss by entering that information in the translation field (labeled 'ENG') instead. New glosses are automatically entered into the shared dictionary.

Texts are stored in a pseudo-XML format and a flexible search mechanism is provided that helps the user find entries in the dictionary by abstracting away from gender marking and ablaut patterns. The user may also hyperlink search results so that searching in the dictionary leads the user to the contexts in which a given word is used. Facilities are provided for exporting and processing the text data for inclusion in publishable material, including dictionaries and collections of texts.

In addition to BITC, the project has a lexical database stored in Filemaker Pro that is regularly exported to a publicly-accessible search interface on the web. Unfortunately, the BITC dictionary and the lexical database are not yet coordinated, and there is no option of accessing the context of a given word from the public access search interface.


Linguistic Exploration Workshop