Paper presented at the workshop on
Web-Based Language
Documentation and Description
12-15 December 2000, Philadelphia, USA.
Abstract. The paper discusses possible ways to enhance accessibility of linguistic information provided by interlinear morpheme translation (IMT) to the users not familiar with the language documented in a glossed corpus. The core of the prop osed user interface is a concordance of grammatical morphemes, accessible both in the form-to-gloss and gloss-to-form modes. This basic means of grammatical quering can be extended in two ways, with minimal additional descriptive efforts. First, I suggest a simple semi-automatic method to provide essential contextual information for each morpheme, which would allow the user to launch relatively complex grammatical queries. Secondly, the grammatical glosses employed in the IMT can be integrated into a net of "grammatical classifiers", so that the language-specific morphological meanings are classified in terms of a more universal (hence, generally understandable) system of linguistic concepts. It is shown that the same formal devic e can be used to introduce language phenomena that are not directly manifested in the IMT, but can be elicited by compex grammatical queries (syntactic structures, meanings signaled by combinations of morphemes, etc.)
Interlinear morpheme translation (IMT) is an indispensable and rather powerful tool for transparent presentation of linguistic data. In many languages, it renders other types of interlinear tagging (parts-of-speech, syntactic categories, etc.) superflu ous, since all these parameters are uniquely determined by morphemic structures. Put it differently, for a variety of grammatical phenomena, an appropriate IMT provides enough information to find all instances of a phenomenon in the corpus. However, a suc cessful grammatical query can only be launched if (a) the user is familiar with the grammatical metalanguage adopted in the corpus and (b) has at least some knowledge of how the GP being investigated can be manifested in the given language (i.e. a success ful query must be formulated in language-specific terms). Yet (a) an adequate IMT of a previously undocumented language is likely to require some non-standard grammatical labels, and (b) the users of a web-based corpus will often have little (or no) previous knowledge of the language (especially if this language is poorly documented, as it is the case for the vast majority of the languages of the world). As a result, most of the expressive power of IMT is lost for the majority of potential users.
Recent years have seen some attempts to address this problem (referred to as accessibility problem below) in terms of a standard computer-based descriptive framework. My experience, both as a descriptivist and as a participant in one of these attempts (Lehmann 1998), has convinced me that this approach is practically unfeasible and theoretically wrong, at least at the present time. Most importantly, the adequacy of IMT should not be compromised for the sake of accessibility; afte r all, any corpus of language data is likely to survive any specific descriptive framework. Although it goes without saying that a web-based repository of linguistic corpora will benefit from a standard glossary of grammatical terms (including conventiona l labels to be used in IMT), yet it should be viewed as a help to descriptive linguists, not as an obligation to use these and only these labels. From the practical point of view, the requirements imposed by a standard descriptive framework prove to be virtually prohibitive for potential contributors of language data: in effect, it is required that any glossed corpus be linked to a comprehensive and strictly structured grammatical description of the language.
These considerations suggest that the accessibility problem has to be solved by means of an efficient "user interface" to IMT, rather than by standardization of the underlying linguistic description. The core of the interface proposed in the present pa per is constituted by the morphemic index (MI), i.e. a list of glossed morphs linked to their occurrences in the corpus, accessible both in form-to-gloss and gloss-to-form modes. I will suggest two ways to enhance the efficiency of MI as a means to access unfamiliar language data; one is based on the concept of "relevant context" (Section 2), the other, on the concept of "grammatical classifier" (Section 3). All examples in the body of the paper are from the Yuk aghir languages of Northern Siberia.
In its "plain" form, an MI is just a concordance of morphemes, i.e. it provides direct links from a morpheme to all its instances in the corpus. My first proposal is to include an intermediate level of "relevant contexts". In this case, each morpheme i s linked to a list of its possible contexts and text references are provided for each context separately. This infrastructure of MI is intended to enhance the efficiency of the interface in two ways:
First, the list of relevant contexts itself often constitutes an important piece of linguistic evidence (cf., for instance, the list of all compatible verb stems for an aspect or Aktionsart marker). As a side effect, this information is often sufficien t to clarify the intended meaning of a non-standard (or ambiguous) grammatical label, or at least to show the user whether the morpheme manifests a phenomenon (s)he is interested in. For example, "Frequentative" can be used in at least two different sense s, for an Aktionsart marker or for a morpheme deriving adverbs like "once", "twice", etc. from numerals and other quantifiers; the list of all compatible stems automatically resolves this ambiguity. In a similar way, "Plural" may be used to gloss a nomina l plural marker or a cross-reference marker on the verb, yet in most cases such meanings are distinguished by the immediate morphological context, etc.
Secondly, this list allows the user to select various interesting subsets of text instances of the morpheme in a (more or less) informed way. This partly compensates for the obvious drawback of IMT with regard to the accessibility problem, namely, its context-independence (EUROTYP Guidelines, 3.1:1). The point is that if a polysemous morpheme is always rendered by the same gloss, there is no straightforward way to select only those text instances that are associated with one spec ific sub-meaning. The list of relevant contexts provides a classification of text instances into linguistically significant subsets and thereby indirectly resolves polysemy.
The latter statement requires some qualifications. It is of course not always the case that the actual meaning of a morpheme is determined by transparent and easily identifiable parameters of the context. In some cases, the only way to ensure the possi bility to select a certain subset of instances automatically is to resolve polysemy in IMT, that is, to use distinct glosses for different sub-meanings. Interestingly, the EUROTYP Guidelines contain a notable exception from the prin ciple of context-independence, namely, different glosses of the same morpheme are allowed if they are included in the list of conventional grammatical labels (as provided in the same source), that is, in effect, if the community of potential users has alr eady recognized that they may wish to look at how the given meaning is expressed in various languages and established a "conventional label" for the given linguistic phenomenon. Basing on the general considerations outlined in the previous section, I believe that such exceptions should be determined by language-specific considerations, rather than by any pre-defined list of conventional labels. Consider the following example:
Example 1. Yukaghir has a morpheme that can serve either as Action Nominalizer or as the verbal marker of Subject-Focus construction. Although the actual function of the suffix can be identified in any given context, it is hardly possib le to formalize the relevant contextual features. Even for a language expert using a powerful query language, it will be helpful if these uses are distinguished by means of different glosses (although there seems to be no standard gloss for the "Subject-F ocus" function).
The notion of "relevant context" as an element of the user interface is intended for more transparent and straightforward cases of context-dependencies, as in the following example:
Example 2. Yukaghir has a suffix that can build either an adnominal verb form ("active participle"), or a finite form of intransitive conjugation. The actual function of the morpheme in question is determined by an easi ly identifiable morphological context: in the former case, it is the final morph of a word form, in the latter, it must be followed by a cross-reference marker. If these contexts are distinguished in the MI, the text instances of each sub-meaning can be e asily accessed.
The context-dependent polysemy can be viewed as an instance of a more general class of linguistic phenomena that require an essentially similar type of information to be accessible for IMT-based grammatical queries: this class comprises all cases where a certain meaning is signaled by a combination of morphemes (rather than by a single morpheme).
In the general case, one morpheme can be associated with several classifications of contexts that can be thought of as "relevant" from different points of view. However, my experiments with various types of data have convinced me in the overwhelming si gnificance of the traditional distinction between derivation and inflection, hence, between two major types of "relevant contexts", lexical ("stem") and grammatical ("inflection"). Roughly speaking, the lexical contexts are relevant for derivational morph emes, while the grammatical contexts are relevant for inflectional morphemes. However, in order to avoid the theoretical problems involved by this classification, the former class of morphemes will be referred to as D-affixes, the latter, as I-affixes. Note that, according to this definition, a single affix can belong to both classes (if both types of contexts are considered by a language expert to be relevant, i.e. worth distinguishing in the MI), or even to neither of them. The basic v ersion of my proposal is just to provide the list of lexical contexts for each D-affix and the list of grammatical contexts for each I-affix.
In order to formulate this proposal in a more precise fashion, it will be convenient to introduce the notion of inflection chain, or I-chain for short. The I-chain of a word form is the (possibly discontinuous) string of glossed morp hemes that determines the grammatical class of word form (in other words, the string comprising all inflectional morphemes). All morphemes that do not belong to the I-chain of the given word form, constitute its stem. The MI must show the user th e list of all compatible stems for each D-affix and the list of all compatible I-chains for each I-affix; from this list, the user can select the contexts (s)he is interested in, hence, the corresponding subset of text instances of the morpheme. The empir ical observation behind this proposal is that most (if not all) relevant parameters of the context reside within the I-chain for inflectional affixes and within the stem for derivational affixes.
Note that it will be quite easy for a language expert to define the set of all possible I-chains, hence, to single out the I-chain of each word form automatically. In the simplest case, this set is determined just by the list of inflectional (grammatic al) meanings expressed the given language. More specifically, the following simple rule applies in the majority of cases: if the gloss of an affix belongs to the pre-defined list of inflectional meanings, then this affix belongs to the I-chain of a word f orm. It is possible, however, that a morpheme can function both as an I-affix and as a D-affix, depending on its grammatical context. This situation can be illustrated by the following example:
Example 3. In Yukaghir, the Ablative marker can be attached either to a closed class of spatial adverbs and postpositions or to the Locative form of nominals. In the former case, the information on compatible stems is linguistically sig nificant, hence, it is a D-affix. In combination with the Locative marker, the same affix just builds the Ablative case form, compatible with any nominal stem; in this grammatical context, it is an I-affix.
It is clear that such cases are also easy to handle automatically. The corresponding entry of MI falls into two sub-entries: one shows all I-chains that include the given affix as an I-affix, the other, all stems that are compatible with this affix in other grammatical contexts.
The outlined solution has two major advantages. First, it involves minimal additional descriptive efforts: given the list of inflectional meanings of the language (and, possibly, a list of special context-dependent cases), all the rest can be done auto matically. Secondly, it will cover the vast majority of relevant contextual parameters. The latter feature can also be viewed as the major drawback: the resulting classification of contexts will be more detailed than needed for each specific task. Returni ng to Example 2, the possible I-chains will include a variety of affixes (temporal, modal, etc.) that are not required to maintain the distinction between the non-finite (adnominal) and finite uses of the morpheme. Notably, however, they cannot be viewed as something entirely irrelevant; on the contrary, the distribution of temporal and modal affixes within the list of possible contexts immediately reveals the relevant distinction, even in absence of any addition al information. As it seems, this situation is typical (rather than exceptional), that is, the list of possible I-chains generally gives more information than might be expected. Nonetheless, it will be useful to enhance this basic solution with additional information on some contextual features and related semantic distinctions.
The IMT employs only a subset of grammatical labels (or grammatical terms) that are significant for the given language. The term "grammatical classifiers" is intended to cover those grammatical terms that, roughly speaking, would appear in a g rammar of the language, but not in its IMT. My second proposal is to integrate a net of grammatical classifiers into the user interface, i.e. to provide a list of grammatical classifiers which would be linked to the entries of MI.
One important group of grammatical classifiers comprises names of classes (e.g. parts of speech) and names of categories (e.g. case or aspect). Such terms share two related features that determine their significance for the accessibil ity problem. On the one hand, they constitute natural "points of departure" for the user of a corpus. On the other hand, such terms are generally more universal (hence, "standard") than those needed to describe specific morphological meanings. Whereas the meaning of a specific morpheme must often be rendered by a non-standard grammatical gloss, a higher-level classification of such meanings can and should be based on a standard set of grammatical classifiers. For example, a language with a rich case syste m is likely to have at least some cases for which there are no appropriate standard terms, but the category as a whole can be still referred to as "case", etc. It is of course also possible that a language will have a category for which there exists no appropriate standard term; in this case, this category can be linked to a more general standard term (like, e.g., modality). As it seems, virtually all grammatical glosses can be linked (directly or indirectly) to generally understandable names of categories and/or word classes (cf. Lehmann 1996); these links can be automatically propagated to establish converse relations (that is, from the name of category to all members of the category, or from th e name of class to all affiliated morphological meanings). The resulting net of grammatical classifiers can significantly increase accessibility of a glossed corpus: first, such links integrate each non-standard label into a generally understandable syste m of coordinates and thus give the user strong hints of its intended sense; secondly, the user can access language data starting with natural and familiar "points of departure", e.g. by browsing the list of all cases or of all mor phemes associated with demonstrative pro-forms.
The net of grammatical classifiers can be extended to cover other grammatical phenomena which are not directly manifested by the IMT. There are at least three types of such phenomena: syntactic structures, meanings signaled by combinations of morphemes (rather than by single morphemes), and sub-meanings of polysemous morphemes. The major purpose of this extension is to provide a necessary minimum of information on how various grammatical phenomena are manifested in the given language. Consider the foll owing example:
Example 4. In Yukaghir, the combination of the Inferential and Future markers within one verb form signals the hypothetical mood. The idea is to include the term "hypothetical" into the net of grammatical classifiers and to link it to t he Inferential and the Future entries in the MI. Accordingly, the user will be automatically informed that these morphemes can be employed to express hypothetical meaning.
Similarly, a term like "relative clause" can be linked to certain pronouns and/or verbal affixes, thus giving a user interested in relativization a hint of what to look for in the corpus, i.e., ultimately, how to formulate a sensible grammatical query.