Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
Abstract. Creating language data has always meant a large effort, both in collecting the raw data and in analyzing it. Morphologically analyzed interlinear text is one example: without the assistance of the computer, consistently and accurately analyzing and glossing texts is virtually impossible. This paper proposes a methodology for not only the creation of consistently analyzed interlinear texts, but also for building up a linguistic motivated description, in the form of a lexicon and morphological grammar. The approach outlined here may be used by semi-trained linguists on small text corpora, and is therefore applicable to lesser known languages.
Languages for which no adequate computer processing is being developed, risk gradually losing their place in the global Information Society, or even disappearing, together with the cultures they embody, to the detriment of one of humanity’s great assets: its cultural diversity.
- Zampolli and Varile, Forward to the Survey of the state of the art in human language technology. (1997: xvi)
The information processing equipment, for its part, will convert hypotheses into testable models and then test the models against data (which the human operator may designate roughly and identify as relevant when the computer presents them for his approval). The equipment will answer questions. It will simulate the mechanisms and models, carry out procedures, and display the results to the operator... In general, it will carry out the routinizable, clerical operations that fill the intervals between decisions. Finally, it will do as much diagnosis, pattern matching, and relevance recognizing as it profitably can, but it will accept a clearly secondary status in those areas.
- J.C.R. Licklider, "Man-Computer Symbiosis." (1960)
The creation of language data can be a tedious and even onerous task, with the possibility of errors creeping in at any point in the process. This is true not only at the stage of initial transcription of data, but also of the stage where the data is marked up with linguistic tags and glosses.
A common kind of markup is interlinear text, in which the original text is displayed on one line, and the tags and glosses are displayed in one or more additional lines, more or less vertically aligned with the text of the original line. This paper will concentrate on one genre of interlinear text, namely morphologically analyzed text. A morphologically analyzed interlinear text will have, in addition to the original text, one or more lines, each displaying one or more of the following sorts of data: morpheme boundaries or bracketing, often represented by breaking the original word into allomorphs; underlying forms of morphemes; glosses for those morphemes in some language of description; and a more or less literal translation of each word in the original text into the language of description.
There are numerous opportunities for error or inconsistency in interlinear glossing. Two of the major problems are inconsistent decisions as to morpheme boundaries, which leads to inconsistent spelling of allomorphs; and inconsistent glossing of morphemes. To some extent, these issues have been addressed by existing computer programs. These interlinear text processing programs typically incorporate a lexicon of morphemes, a parser, and some mechanism to reduce the ambiguity which usually arises in languages with more than trivial morphologies. The ambiguity reduction mechanism can be viewed as a sort of grammar; whether that grammar is something a linguist would be proud of (or even understand) is a different question. Ad hoc solutions are sometimes the easiest to build into the interlinear text processor; ad hoc analyses may also be the easiest for the field linguist to deal with initially, since in the early stages, it is often unclear what a linguistically accurate analysis would be - and indeed, the purpose of doing interlinear text processing is often to gain insight into the linguistic structure of the language. Too often this often means that the morphological analysis never moves past the ad hoc stage; and if and when the analysis becomes more linguistically motivated, there is no way to test it or debug it on the computer, since the interlinear text processor is unable to make use of an analysis in that form.
To be sure, there are parsers which can be used for interlinear glossing at the morphological level, which accept reasonably linguistically motivated grammars. But too often their use requires the linguist to have a well-understood grammar in his head before starting, while at the same time forcing him to reconstruct that grammar in the particular notation which the parser understands. Thus, these parsers are not a solution which can be used during the beginning stages of language analysis.
This paper proposes a method for seamlessly interweaving the production of morphologically glossed interlinear text with the production of a linguistically motivated morphological grammar. The same computer program suffices for all stages of analysis, from simple hand markup with minor assistance from the parser, to the stage at which at which the parser produces nearly all the markup, and the human need disambiguate only those words whose correct analysis must be decided on the basis of context. Moreover, the output - assuming the linguist perseveres through the later stages of analysis - is not only interlinear text, but a lexicon and a linguistically motivated morphological grammar, which has been tested and debugged against real data.
The machinery required to implement this methodology consists of the following parts:
What is perhaps unique about this machinery is not the modules themselves (most interlinear parsing programs provide at least modules 1–3, and many provide module 4 as well), but rather the range of grammatical properties and grammatical constraints which can be used to constrain parsing, and the way in which complexity can be hidden from the user who is not ready to deal with it.
In order to support the end goal of a (reasonably) linguistically sophisticated morphological analysis, the grammar and lexicon together implement much of the power of lexical morphology and phonology. For instance, there is a notion of strata of rules and affixes; affixes can be described either as items or as processes; parts of speech can be assigned paradigms, and there is provision for distinguishing among inflectional classes; and allomorphy can be described in terms of underlying forms together with phonological rules which modify those forms. But at the same time, it is entirely possible to constrain the morphology using less sophisticated notions, such as allomorphs which are listed in the lexicon (rather than derived by phonological rule), and whose appearance is governed by phonological constraints. At an even less sophisticated stage of analysis, it is possible to say simply that two allomorphs cannot appear in the same word (or that they cannot appear next to each other in the same word, perhaps in a certain order). The ability to constrain the grammar throughout the range of sophistication, and even to include in a single analysis some constraints which are linguistically apt and others which are ad hoc, is at the heart of this methodology.
Hiding complex capabilities from the user who is not ready for it is another key component of this approach. This is more than simply not requiring the user to fill in properties which are irrelevant for a particular language (such as inflection classes in a language for which all words of a given part of speech belong to the same inflection class). A good example of what we have in mind is the morphosyntactic feature system. A typed feature system can be intimidating, so in our approach it is hidden behind the grammatical glossing system. This will be illustrated in the presentation, but for now an example may help: the gloss '1' (for 'First Person') can be seen as an abbreviation for a feature like [person 1], or even a set of features such as [+ Speaker –Hearer]. (Given an appropriate context, the gloss can be understood as an abbreviation for such a feature embedded inside a feature for subject, object, or possessor agreement.) Unlike a feature system, the use of grammatical glosses should be familiar to anyone glossing interlinear text. The proposal is thus to attach predefined feature values to glosses, thereby hiding the feature system until such time as the user wishes to deal with it directly.
During the talk, the capability of describing the grammar at varying levels of sophistication, and the capability of hiding unwanted detail, will be illustrated by mockups. The methodology will then be contrasted with other approaches to acquiring computer-usable morphological grammars.
The method described here should be viewed as a mutually assisted discovery process in which the linguist and the program work together. As such, it seems well-suited to minority languages, particularly where there are no highly trained native speaker linguists, and where corpora may initially be either small or non-existent.
At the same time, this technique can be seen as a training program for the field linguist, who is actively involved at each step. As a result of this human-machine interaction, human developers should be better able to make use of and extend the grammar than if it had been produced directly by the computer, without human intervention.