# Exploring and disseminating field data using HyperLex

## Steven Bird, University of Pennsylvania

Linguistic Data Consortium, University of Pennsylvania
3615 Market St, Suite 200, Philadelphia, PA 19104-2608, USA
sb@ldc.upenn.edu | www.ldc.upenn.edu/sb

Researchers who investigate little-studied languages are often faced with either of the following two problems: too little data, or too much data. A researcher relying on linguistic data published in the literature typically lacks sufficient data to do a re-analysis. Data items that would be crucial for the re-analysis (including counter-examples) will often be absent, simply because they were not germane to the published analysis. Conversely, a researcher who has immediate access to a community of speakers can quickly collect large amounts of new data. But once analysis work is attempted, he or she has a difficult time identifying the relevant cases and re-systematizing the data according to each new analysis. What is needed is a seamless approach to the creation, management, dissemination and citation of online linguistic datasets, facilitating the collaborative construction of linguistic knowledge by native speakers, field linguists and university researchers. Here, I describe and illustrate an approach which was developed on and off the field over the last 5 years.

The approach revolves around the HyperLex system [1] [www.ldc.upenn.edu/hyperlex]. HyperLex is a Perl/CGI script for exploring lexicons and paradigms (involving text and audio data) stored in SIL Standard Format (see Figure 2) and for creating many different visualizations of complex datasets. The system is very useful for discovering regularities and exceptions, and it has provided the foundation for published studies of tone [4] and syllable structure [3]. The online lexicon was the basis for a published dictionary [2], and also supported research on orthography [5, 6]. Other researchers have used the system (i) to learn more about Grassfields languages, or (ii) to explore their own language data.

As an example of the first case, in his study of vowel frication in the languages of Cameroon, Bruce Connell (Oxford University) has discovered an allophonic contrast which had gone unnoticed in 25 years of scholarship. He made this interesting discovery without access to a native speaker, and without needing to set foot outside his office. The following examples illustrate the contrast.

VOWEL ASPIRATION front ndi to sleep ndI to rot ndu to cry ndi lord ndI raffia string ndu stream

To look up these forms using the online database, launch this query.

The second case - researchers exploring their own language data - is exemplified by experimental applications of HyperLex to Nahuatl, Tamil and Yoruba. At present each new such application requires programming work on the part of the author, but a fully customizable version is planned.

### Exploration

The HyperLex system offers a variety of ways to explore datasets, and these were heavily used for a study of syllable structure [3]. The various search modes can be characterized in increasing levels of complexity as follows (click on the images for full-size versions):

Type Description Example Results
Zeroth order Retrieve a set of entries according to a search expression 1. List all H-initial tone melodies: tone:H.*
2. List words containing aspiration: root:.*h.*
First order Select entries by one criteria and display them according to another (cf data transformation using SQL) 3. Classify nouns according to tone and noun class: part:n class:(.*) tone:(LR|LDH|LLF?) display:word,speech
4. Display a bigram chart for root-initial stop-vowel sequences: root:([pbtdkgcj])([ieaouEOU@]).*
Second order Select entries depending on the existence of similar entries elsewhere in the database. 5. Display minimal pairs for vowels o and u: prefix:(.*) root:(.*)([ou]) x-axis:12 y-axis:3 minpairs:y-axis display:word,tone,speech

Despite its flexibility, the HyperLex model has several critical weaknesses:

• The query form is complex, with regular expressions in the fields, axis fields which refer to elements of regular expressions, and a minimal-pairs field with refers to axes.
• Users need to know a lot about the structure of the database and the representation used in each field in order to be able to formulate good queries.
• Lexical entries must be stored according to a fixed schema, which means that optional fields, repeated fields, and nested structure are not properly supported.
• The interface is good for querying, but not very suitable for browsing and it cannot be used for updates. Query results are not in a form which can be queried.
• For special characters it was necessary to use gifs. While this was workable for display, gifs could not be used in query expressions.
• By default, search is on character strings rather than words, so a query for entries whose definition contains the word crow' will also return entries for crown'. At least for certain fields, word-based search should be the default.
• Regular expressions are only applied once to a field, so that multiple matches cannot be considered. Thus the CV bigram table only considers one CV sequence from each word root. (This was not a major problem above given that roots are generally monosyllabic.)
• The logic of the query form is a conjunction of regular expressions, with at most one regular expression per field. Negation would be useful for eliminating irrelevant entries. Disjunction may also be useful.
• The data is hidden behind a CGI interface, limiting access methods. An external program (eg. a search engine or another research tool) cannot be used to index/access the data.

These issues will be addressed in subsequent versions of the software, to be developed as part of a new NSF project entitled Multidimensional Exploration of Linguistic Databases.

### From the field to the web to the publication, and back again

A common part of any systematic study of a linguistic system is the elicitation of paradigms. (A paradigm is broadly construed to include any kind of rational tabulation of words or phrases to illustrate contrasts and systematic variation.) Early versions of a paradigm may take form in a linguist's field notes. Figure 1 shows part of a paradigm that was created as part of my fieldwork in Cameroon (click on the image for a larger version):

Figure 1: Part of a Verb Paradigm for Bamileke Dschang

Each cell in this table contains an utterance, with a tone transcription above and a preliminary tone analysis underneath. This is a small fragment of a much larger, 8 dimensional paradigm: 3 speakers * 9 tenses * 5 moods * 4 subject noun tones * 2 subject noun classes * 2 verb tones * 2 object prefixes * 4 object tones. This larger paradigm does not need to be fully populated (for example, there is no interaction between subject and object tone, so they do not need to be co-varied). Systematic exploration of this complex tone paradigm was only possible once the data was stored online. Prior to that, most of the time was spent recopying various slices of the paradigm alongside the analysis work. The storage format for an individual entry is shown in Figure 2. This line-based ASCII format with alphanumeric backslash codes is known as SIL Standard Format, and is used by the Shoebox program amongst others (see www.sil.org/computing/shoebox.html).

Figure 2: The Online Version of a Paradigm Entry

\re 0091                              # reference number
\va                                   # validations still required
\sp pn                                # speaker id
\tn f1                                # tense (future 1)
\md i                                 # mood (interrogative)
\au OH1                               # sound file basename
\ts L                                 # lexical tone on subject noun
\cl 1                                 # noun class of subject noun
\tv H                                 # lexical tone on verb
\op y                                 # object has a prefix? (y/n)
\to L                                 # lexical tone on object noun
\tr efO kapte menzwi?                 # ASCII transcription
\pi 3  3 2   2   2  2    4 5          # pitch transcription
\se e fO a  kap te men zwi i          # segment tier
\as |  | -   |   -  |    | |          # associations
\t  L  L -   H   -  L    L L          # tone tier

When viewed with a web browser via the paradigm system interface, this entry is formatted as shown in Figure 3. The ASCII transcription is converted into a series of gifs, the sound file basename is converted into a pair of hyperlinks to the audio and laryngograph recordings and to an image tag for the pitch trace. Vertical marks are inserted into the segment tier, corresponding to Hyman's prosodic domain boundaries [7]. (This display can be regenerated with this query.)

Figure 3: Browser View of a Paradigm Entry

A full interactive interface to this paradigm system is available at www.ldc.upenn.edu/cgi-bin/sb/paradigm/paradigm.cgi. The system is currently tailored to the database schema of Figure 2, but a customizable version will be developed and openly distributed in 2000.

Note the use of IPA symbols in the Figure 3. This is actually the recognized orthography of the language, and the symbols were created using a CGI script: www.ldc.upenn.edu/cgi-bin/sb/ipagif/ipagif.cgi.

An important feature of the system is the way that it decouples (i) search terms; (ii) display axes; and (iii) displayed fields. One can specify the items of interest using one set of fields, categorize the items for display according to another set of fields (the axes), and then display yet another set of fields (or a summary statistic) in each cell of the resulting table. This ability to generate useful tabulations does not remove the need to work with pencil and paper. On the contrary, by paying the overhead of entering the data online, the analyst saves much time later on, being able to print out a wide variety of tabulations in the search for elusive patterns.

Once an analysis is completed, fragments of such paradigm data may be reported in published articles. The web interface generates both HTML and LaTeX output, so once one is happy with the layout of a table it can be saved into a LaTeX document. Figure 4 gives such an example, taken from [4] (click for enlargement):

Figure 4: The Published Version of a Paradigm

This charts the path of data items from field notes through to publication. However, the other direction is no less important. A reader of the published version may want to know where a particular item or summary statistic came from (known in the database literature as its provenance or lineage). Here is one way in which the `data trail' can be followed.

Suppose that, in addition to the published article a web version was made available (possibly of just the data displays). An example of such a web page for [4] is at www.ldc.upenn.edu/sb/fieldwork/. As the reader works through the printed article they can listen to the data items and check the reported transcriptions. Figure 5 shows the web version of Figure 4 (click for full table):

Figure 5: The Web Version of the Published Dataset

In the interactive version of this display, each data item - IPA and tone transcription - is clickable, and one can hear digitized audio and laryngograph recordings. Below the tabulation (the enlarged version) are hyperlinks for embedded queries; these reproduce the tabulation from the underlying database, and document the relationship between the data tabulation and the database (a relationship which is not always recoverable from the database and tabulation alone). By relaunching the query (with modifications if so desired), the reader can access 2-3 orders of magnitude more data than could have been published. The reader can see how well the reported findings generalize, and test out alternative analyses.

In this way, online documents can cite individual data items or whole tables, simply by the use of embedded query expressions. The overall process reported here has a number of components - data collection, uploading, database construction, interface construction, and building an analysis which incorporates links to the dataset. In ongoing work I am investigating ways to facilitate the construction of reusable, citeable linguistic databases as an integral part of fieldwork activities.