Reference Grammars for the Computational Age:
From Gleason Files to Sci-Fi Grammar

David J. Weber, Summer Institute of Linguistics

209 Lorraine Ave.; Syracuse, NY 13210, USA
david_weber@sil.org


Historical note

In 1974 at the outset of my study of Quechua, I had the good fortune to learn from Harold Gleason how to make and use an exhaustive morpheme concordance on paper slips. For several years thereafter I organized my Huallaga Quechua data, both texts and elicited sentences, around this file, building a solid corpus and learning from it.

In 1983 I defended my dissertation, a reference grammar of Huallaga (Huanuco) Quechua. During my defense I complained about having to describe a grammar by means of a book, and sketched what I thought a grammar should be in the computational age: an information management system at the heart of which would be a corpus somewhat like the one I had built on paper slips. These ideas were later published in Notes on Linguistics 33:28-38 (Summer Institute of Linguistics, 1986). What appears below is a slightly edited and shortened version of that paper.

In 1989 Gary Simons circulated a proposal to develop a Computational Environment for Linguistic and Literary Research (CELLAR). I had hoped to implement the Huallaga grammar in this environment but it never matured into a practical tool for that purpose. Having recently talked with Gary, I am now optimistic that CELLAR's offspring, SIL's “Santa Fe” initiative, will provide the environment for writing grammars as outlined below.

Introduction

A reference grammar is a collection of information about a language, organized in such a way that users can access that information. Reference grammars are an important class of documents for the following reasons: First, they serve a language community

  1. by helping the speakers appreciate the complexities of their language
  2. as a vehicle for those who wish to join or enrich the community by learning the language, and perhaps
  3. by providing a unifying force for the language by defining a norm (which may even be prescriptively imposed).
Second, reference grammars serve the linguistic community in that they
  1. collect and make available what is known about a language
  2. are thus the basis for many theoretical insights (for example, consider how often Otto Jespersen's work on English is cited as the source of some insight), and
  3. are particularly important to topologists searching for cross-linguistic generalizations.
Because reference grammars are enduring repositories of precious knowledge, they are of interest centuries (or even millennia) after they are written.

The quality of a reference grammar is in proportion to how accurately it characterizes the language. Its usefulness is inversely proportional to how much work it takes a user to find information from it. This paper sketches a way to improve both the quality and usefulness of reference grammars.

Limitations of present reference grammars

At present, reference grammars are books. Consequently, they share the following properties:

  1. They are static, having been fixed at the time the author turns the final manuscript over to the publisher.
  2. Their organization is inevitably linear. A typical reference grammar has a dozen or more chapters, one following another. These are organized into sections (one after another), subsections, and so forth.
  3. To help the user find information, they generally have a table of contents and sometimes an index. They may have cross-references within the text. (For example, Li and Thompson's Mandarin Chinese has all three.)
  4. Virtually the only way a grammar refers to the corpus on which it is based is to include examples drawn from the corpus.
  5. Lexical information about the language is generally assumed to be handled in another document, a dictionary. It may be referred to or cross-referenced from the grammar.
  6. To overcome limitations 4 and 5, many reference grammars include a vocabulary and/or a collection of texts. (For example, Willem Adelaar's Tarma Quechua: Grammar, vocabulary, texts includes both.)
For each of these properties, there are consequent disadvantages:
  1. Because (written) grammars are static, they cannot be easily modified: they can be neither corrected nor expanded. This is a serious limitation because no grammar is ever perfect or complete. Some of the most-used grammars have been developed over many years, with input from various scholars (see, for example, R. Funk's Preface to his English translation of Blass and Debrunner's A Greek grammar of the New Testament). It is also a serious limitation because a language is not static. As language changes, it should be possible to amend the grammar to reflect the changes. It should be possible to expand the corpus on which the grammar is based and perhaps encompass closely related languages.
  2. The linear organization of grammars in no way reflects the structure of language itself. Language is an organic whole, a complex of subsystems so tightly interwoven that change in one part generally has consequences in many other parts. Forcing a grammar into an outline is, in itself, a misrepresentation of its structure (one that I suspect has led to considerable frustration for most grammar writers).
  3. Mechanisms like a table of contents, an index, cross-references are at best crude services to the user for finding information, inevitably causing him or her a great deal of page flipping. I am perhaps not unique in frequently wishing, upon searching a grammar for same information, that I had a few more fingers to keep my place in various references.
  4. Citing examples as the way a grammar relates to a corpus presents many problems:
  5. There is nothing natural nor convenient about handling lexical information separate from the grammar. The lexicon is an integral part of the language. (Some recent theoretical approaches regard the lexicon as unique or preeminent structure around which the grammar is organized.) Further, the problems of including sufficient examples with adequate context (as discussed above) are just as severe for the dictionary as for the grammar.

  6. It is valuable to include a vocabulary and collection of texts in a reference grammar, but since cross-referencing between the grammar and these is difficult for both the author and the user, never are they tightly integrated with the grammar.

These, then, are some of the limitations of current reference grammars. These limitations are the consequence of a grammar being a book.

The reference grammar of the future

The purpose of this paper is to suggest that the above-mentioned limitations could now be overcome with computation: a reference grammar should be an online, interactive, information management system, built around a corpus.

A reference grammar should implemented in an information management systems that minimally provides hypertextual organization, rich cross-referencing, user-directed navigating (brousing). A rudimentary example might be the info system developed for emacs and later adopted as the primary documentation delivery system for the Free Software Foundation's software.

The lexicon, like the grammar, would be an information structure,  integrated through cross-referencing with the rest of the grammar. Direct access to a morpheme's entry should be possible on the basis of the characters out of which that morpheme is formed, as afforded by the alphabetical organization of a traditional dictionary. Because the lexicon is organized as an information structure (database), it should be possible to access a morpheme by its semantic domain, part of speech, or by whatever information is included in lexical entries. Conversely, from a lexical entry it should be possible to access examples in the corpus.

The corpus

I will now outline how the corpus might be implemented, first describing the skeletal structure and then what should be attached to it. Although much of this is presented in terms of “pointers,” neither the grammar's author(s) nor its readers need to know about these. They should know the system in terms of linguistic constructs (morphemes, functions, and so forth) and in terms of what they can do: create and edit descriptive text, enter text into the corpus, refer to examples, and so forth.

The basic corpus “ring” is composed of four “planes” (collections of similar information with some plane-specific organization), each plane having a different sort of information, this information linked to other planes by pointers:

  1. Text plane: a sequence of pointers to allomorphs. These will be referred to as instances: each pointer corresponds to an instance of the use of a particular morpheme in text. From each pointer, there may be a pointer to a function (described in 4 below).
  2. Allomorph plane: a set of character strings, each of which represents the form in which a morpheme occurs in text. With each allomorph is a pointer to the corresponding morpheme.
  3. Morpheme plane: a set of (names of) morphemes, with associated information about the morpheme (such as underlying form, meaning(s), morphophonemic properties, and so forth). With each morpheme there is one or more pointers to functions.
  4. Function plane: a set of (names of) functions, with perhaps information characterizing it. With each function there is one or more pointers to instances of this function in text.

Consider the following example:
 
2. Qeru-pita
wood-ABL
rura-sha.
made-PART
`It is made of wood.'

This might be stored as follows:

  1. The text plane would contain pointers to the character strings qeru, -pita, rura-, and -sha in the allomorph plane. (There may also be a pointer from the instance of -pita to 'material' in the function plane; see 4 below.)
  2. The allomorph plane would contain a pointer from the character string qeru to the morpheme named QERU, from -pita to ABLATIVE, from rura- to DO, from -sha to PARTICIPLE. With PARTICIPLE there may be information such as the historical form (-shqa), that it acts as though closing the preceding syllable, and so forth.
  3. The morpheme plane contains (names of) the morphemes WOOD, ABLATIVE, DO, and PARTICIPLE, and with each some information about the morpheme. With a morpheme (name) like ABLATIVE there would be a pointer to the function 'material'.
  4. The function plane contains MATERIAL, with an explanation that one function of the 'ablative' suffix (ABLATIVE) is to mark the material out of which something is made). With MATERIAL are pointers to the instances of -pita in the text plane used in this way, so there is a pointer to -pita in the example Qeru-pita rura-sha.

To summarize, the four planes are related by pointers: from the text plane to the allomorph plane, from the allomorph plane to the morpheme plane, from the morpheme plane to the function plane, and from the function plane to the text plane. (This does not preclude the corresponding back references but the primary utility of the corpus for the grammar writer is provided by the above-mentioned pointers.)

This skeleton would be enriched with various types of information. The most pervasive would be descriptive text associated the primary elements of each plane:

  1. Description in the text plane would be primarily about instances, for example, something might be said about a particular instance of -pita. Further, it should be possible to attach labeled bracketings to portions of the text plane (relative clauses, compound nouns, and so forth) so that these can be referred to from the grammar. And it should be possible to add descriptive text to labeled bracketing itself.
  2. Description in the allomorph plane might, for example, discuss phonetic detail (for example, a comment with yka might say that /k/ is often palatalized) or properties particular to that allomorph (for example, with ki that it must follow /i/, and so forth.
  3. Description in the morpheme plane might, for example, discuss the morphophonemic properties of the morpheme making reference to its allomorphs. Also, there should be a label for each morpheme, to be used in morpheme-by-morpheme glossing.
  4. Description in the function plane could give characterizations of the function. For example, the MATERIAL function of the ABLATIVE morpheme -pita should explain that one use of -pita is to indicate the material out of which something is made or formed. This sort of characterization allows one to avoid the often valid criticism that such functions are simply labels. (As Lakoff once observed, 'Pedro' 'Irving' and 'Sam' would do as well.)

Various advantages would follow from storing a corpus this way:

  1. Whenever errors are corrected, or a region of text is further analyzed, this information automatically spreads to any other reference to this region of text.
  2. More useful concordances could be generated: each morpheme could be subdivided by function and examples could be generated with interlinear, morpheme-by-morpheme glossing. Complete concordances could be generated (and stored or printed).
  3. More significantly, as users wish to see examples of a certain morpheme, or examples of one of its functions, they can immediately see all cases or, if they choose, just a few. These would be formatted on the screen with morpheme-by-morpheme glossing. And if, for a particular example, they wish to see more context, they could access that example and scan back and forth across the region of the text plane that contains it. As they do so, the text is formatted on the screen with the glossing.
  4. It would be possible to write a single grammar for closely related languages, providing a corpus for each language. The text plane for the related language would be entirely different so the user could ask to see, for example, Pachitea Quechua examples rather than Huallaga Quechua ones. The allomorph plane would be largely shared, differing only where there are dialect differences and actual differences. If the two dialects are very closely related, that is, there is a one-to-one correspondence between morphemes and between functions, then they would completely share the morpheme and function planes.

Access and privilege

A computational reference grammar should have a system of access and privileges. Novices should not be allowed to modify things. It should be possible to protect information that an author is not yet willing for others to see. Some participants in the grammar writing task might need and merit full access, while others might only relate to the portion for which they are responsible. Some co-workers might be given certain capabilities, for example, to add texts to the corpus, but not allowed to modify the grammar, this privilege being limited to the primary authors or editors.

While a system of access and privileges may seem somewhat silly, applied to something like a reference grammar, it is an important feature if two or more scholars collaborate on a grammar, and if its use is to be made available to other scholars and the public at large. Scholars need credit for the work they do. They do not want others looking over their shoulders when they have work in progress. They must protect unpublished work from competing colleagues. And they must protect their precious data and insights from accidental loss or intentional damage by others. Scholar will not want to collaborate in the writing of a grammar in the computational framework sketched above unless they can be sure of getting a fair shake.

Users should be able to explore a language through the grammar (information management system) and see examples from the corpus. The signposts indicating what and where information is available, and an easy and direct way to get to the information will make that possible.

If properly organized, the grammar would serve a wide range of users, allowing them to determine the degree of detail and the number of examples they to be seen. Thus, the primary function for users is online, interactive exploration.

Users may also be given the capability of generating printed portions of the grammar, lexicon, or corpus. That is, they should be able to generate self-tailored documents of the sort that are traditionally written about a language. Whether generating a grammar description, a dictionary or a concordance, they should be able to determine the course through the information structures that the document will take, the degree of detail to be included at each level, and what examples (and how much context) are to be included.

Depending on the situation, the document generation capability might be given to only certain users, or it might be restricted entirely to the grammar writers.

Some users and the grammar writers should be given the capability of adding text to the corpus. This might involve interactively working through a text, breaking words into allomorphs, indicating the morphemes to which these correspond, adding glosses and lexical entries, indicating functions, tying certain examples to topics in the grammar, and so forth.

Corpus entry could be computer-assisted. For example, the entry of Quechua text might be aided by the morphological parser, which could display one or more possible analyses of a word on the screen and ask the user to choose, edit, or verifyit.

Some of the corpus entry would necessarily be automated. For example, once an alternative is given, the word could automatically be entered into the text plane, with pointers automatically generated to the relevant allomorphs and morphemes of those planes. The user could then be asked, morpheme by morpheme, to identify functions and, upon responding, the corresponding pointers (to the morpheme and text planes) would automatically be generated.

It should be possible to enter a text without giving a full analysis of it. That is, large portions of some text (in the text plane) might not even be broken into morphemes. These would be refined as the grammar develops and insights gained. To reiterate, a full analysis of a text should not be a prerequisite to including a text in the text plane; the system should be designed so as to allow the grammar and corpus to mature with the linguist's knowledge of the language.

The author(s) must, of course, have the capability of modifying the grammar itself, that is, adding or modifying topics (descriptive text) as well as relating these to the corpus, lexicon and, possibly, ethnography.

Conclusion

Implementing reference grammars along the lines proposed here would afford many advantages:

  1. The grammar would be easy to modify.
  2. More than ever before the corpus could be integrated with the grammar, lexicon and, ideally, an ethnography.
  3. Users would be able to tailor their search for knowledge about a language to fit their particular needs.
  4. It would be possible to generate a number of traditional documents about a language, including essays, dictionaries, concordances, and grammars.
  5. It would be possible to write a single grammar for two or more closely related languages, incorporating a separate corpus (at least a separate text plane) to document the claims for each “dialect.”
  6. It would be possible to write the grammar in two or more languages (for example, in English and Spanish). These would share the corpus and the structure tying together the whole structure (grammar, lexicon, and corpus). But wherever there is descriptive text in one language, there could be a corresponding translation.
  7. Multiple people could cooperate in writing a grammar more easily than is currently possible.

Let me close with an analogy. Let's imagine ourselves in times B.C., holding a 100 foot scroll with a reference grammar written on it. Someone observes that a scroll is a rather clumsy way to have this information, to which we respond, “It's not that difficult. We manage, don't we? And it sure beats several hundred clay tablets!” And how would we react to the suggestion that the scroll be cut into pages, numbering these and creating a table of contents and an index (whatever those are!), all for the unsubstantiated claim that this would make the information more accessible?

I believe that we are now at a similar moment. If suitable software were developed, an information structure that integrates a reference grammar, a corpus, a lexicon, and perhaps an ethnography could be developed, with remarkable benefits for both the authors and their readers.


Linguistic Exploration Workshop