In 1974 at the outset of my study of Quechua, I had the good fortune to learn from Harold Gleason how to make and use an exhaustive morpheme concordance on paper slips. For several years thereafter I organized my Huallaga Quechua data, both texts and elicited sentences, around this file, building a solid corpus and learning from it.
In 1983 I defended my dissertation, a reference grammar of Huallaga (Huanuco) Quechua. During my defense I complained about having to describe a grammar by means of a book, and sketched what I thought a grammar should be in the computational age: an information management system at the heart of which would be a corpus somewhat like the one I had built on paper slips. These ideas were later published in Notes on Linguistics 33:28-38 (Summer Institute of Linguistics, 1986). What appears below is a slightly edited and shortened version of that paper.
In 1989 Gary Simons circulated a proposal to develop a Computational Environment for Linguistic and Literary Research (CELLAR). I had hoped to implement the Huallaga grammar in this environment but it never matured into a practical tool for that purpose. Having recently talked with Gary, I am now optimistic that CELLAR's offspring, SIL's “Santa Fe” initiative, will provide the environment for writing grammars as outlined below.
A reference grammar is a collection of information about a language, organized in such a way that users can access that information. Reference grammars are an important class of documents for the following reasons: First, they serve a language community
The quality of a reference grammar is in proportion to how accurately it characterizes the language. Its usefulness is inversely proportional to how much work it takes a user to find information from it. This paper sketches a way to improve both the quality and usefulness of reference grammars.
At present, reference grammars are books. Consequently, they share the following properties:
| 232. | Kuti return |
-mu -afar |
-r -ADV |
-raq -yet |
miku eat |
-shaq. -1FUT |
`I will eat when I return.' |
For languages with considerable morphophonemics, it may be desirable to introduce yet another tier between the phonetic form and the morphological form. This is extremely helpful to the user, but takes more space and requires more work on the part of the grammar writer.
Grammar writers are perpetually in a quandary about how many examples to include. Accommodating the casual reader, they may include few, thus offending the reader who wishes to see more. Accommodating the more serious scholar, they may include many examples, making it less readable and more expensive.
Examples are generally cited out of context. It is not feasible to include a reasonable portion of the text from which each example is drawn. Yet this context is often crucial to understanding the example. This need is not limited to studies that consider units larger than a sentence. Traditional grammars (and linguistic articles in general) have labored and continue to labor under the tyranny of the page: the impracticality of providing adequate context with examples.
A particular example may be relevant at various points within the grammar. In such cases, the text of the grammar can give the example once and from then on simply refer to it. The disadvantage is that the user must do a lot of page flipping to find the examples. Alternatively, the example can be repeated wherever it is relevant. But this leads to more expense, both in the size of the book and the amount of effort required to create the example.
In a traditional reference grammar, a correction made to one example does not automatically correct all the cases where this example may be used. If one publishes a collection of texts, generates a concordance from those texts, uses parts of them for examples, et cetera, there will be insights and corrections made to later uses which are almost impossible to correct in all the other uses.
There is nothing natural nor convenient about handling lexical information separate from the grammar. The lexicon is an integral part of the language. (Some recent theoretical approaches regard the lexicon as unique or preeminent structure around which the grammar is organized.) Further, the problems of including sufficient examples with adequate context (as discussed above) are just as severe for the dictionary as for the grammar.
It is valuable to include a vocabulary and collection of texts in a reference grammar, but since cross-referencing between the grammar and these is difficult for both the author and the user, never are they tightly integrated with the grammar.
These, then, are some of the limitations of current reference grammars. These limitations are the consequence of a grammar being a book.
The purpose of this paper is to suggest that the above-mentioned limitations could now be overcome with computation: a reference grammar should be an online, interactive, information management system, built around a corpus.
A reference grammar should implemented in an information management systems that minimally provides hypertextual organization, rich cross-referencing, user-directed navigating (brousing). A rudimentary example might be the info system developed for emacs and later adopted as the primary documentation delivery system for the Free Software Foundation's software.
The lexicon, like the grammar, would be an information structure, integrated through cross-referencing with the rest of the grammar. Direct access to a morpheme's entry should be possible on the basis of the characters out of which that morpheme is formed, as afforded by the alphabetical organization of a traditional dictionary. Because the lexicon is organized as an information structure (database), it should be possible to access a morpheme by its semantic domain, part of speech, or by whatever information is included in lexical entries. Conversely, from a lexical entry it should be possible to access examples in the corpus.
I will now outline how the corpus might be implemented, first describing the skeletal structure and then what should be attached to it. Although much of this is presented in terms of “pointers,” neither the grammar's author(s) nor its readers need to know about these. They should know the system in terms of linguistic constructs (morphemes, functions, and so forth) and in terms of what they can do: create and edit descriptive text, enter text into the corpus, refer to examples, and so forth.
The basic corpus “ring” is composed of four “planes” (collections of similar information with some plane-specific organization), each plane having a different sort of information, this information linked to other planes by pointers:
Consider the following example:
| 2. | Qeru-pita wood-ABL |
rura-sha. made-PART |
`It is made of wood.' |
This might be stored as follows:
To summarize, the four planes are related by pointers: from the text plane to the allomorph plane, from the allomorph plane to the morpheme plane, from the morpheme plane to the function plane, and from the function plane to the text plane. (This does not preclude the corresponding back references but the primary utility of the corpus for the grammar writer is provided by the above-mentioned pointers.)
This skeleton would be enriched with various types of information. The most pervasive would be descriptive text associated the primary elements of each plane:
Various advantages would follow from storing a corpus this way:
A computational reference grammar should have a system of access and privileges. Novices should not be allowed to modify things. It should be possible to protect information that an author is not yet willing for others to see. Some participants in the grammar writing task might need and merit full access, while others might only relate to the portion for which they are responsible. Some co-workers might be given certain capabilities, for example, to add texts to the corpus, but not allowed to modify the grammar, this privilege being limited to the primary authors or editors.
While a system of access and privileges may seem somewhat silly, applied to something like a reference grammar, it is an important feature if two or more scholars collaborate on a grammar, and if its use is to be made available to other scholars and the public at large. Scholars need credit for the work they do. They do not want others looking over their shoulders when they have work in progress. They must protect unpublished work from competing colleagues. And they must protect their precious data and insights from accidental loss or intentional damage by others. Scholar will not want to collaborate in the writing of a grammar in the computational framework sketched above unless they can be sure of getting a fair shake.
Users should be able to explore a language through the grammar (information management system) and see examples from the corpus. The signposts indicating what and where information is available, and an easy and direct way to get to the information will make that possible.
If properly organized, the grammar would serve a wide range of users, allowing them to determine the degree of detail and the number of examples they to be seen. Thus, the primary function for users is online, interactive exploration.
Users may also be given the capability of generating printed portions of the grammar, lexicon, or corpus. That is, they should be able to generate self-tailored documents of the sort that are traditionally written about a language. Whether generating a grammar description, a dictionary or a concordance, they should be able to determine the course through the information structures that the document will take, the degree of detail to be included at each level, and what examples (and how much context) are to be included.
Depending on the situation, the document generation capability might be given to only certain users, or it might be restricted entirely to the grammar writers.
Some users and the grammar writers should be given the capability of adding text to the corpus. This might involve interactively working through a text, breaking words into allomorphs, indicating the morphemes to which these correspond, adding glosses and lexical entries, indicating functions, tying certain examples to topics in the grammar, and so forth.
Corpus entry could be computer-assisted. For example, the entry of Quechua text might be aided by the morphological parser, which could display one or more possible analyses of a word on the screen and ask the user to choose, edit, or verifyit.
Some of the corpus entry would necessarily be automated. For example, once an alternative is given, the word could automatically be entered into the text plane, with pointers automatically generated to the relevant allomorphs and morphemes of those planes. The user could then be asked, morpheme by morpheme, to identify functions and, upon responding, the corresponding pointers (to the morpheme and text planes) would automatically be generated.
It should be possible to enter a text without giving a full analysis of it. That is, large portions of some text (in the text plane) might not even be broken into morphemes. These would be refined as the grammar develops and insights gained. To reiterate, a full analysis of a text should not be a prerequisite to including a text in the text plane; the system should be designed so as to allow the grammar and corpus to mature with the linguist's knowledge of the language.
The author(s) must, of course, have the capability of modifying the grammar itself, that is, adding or modifying topics (descriptive text) as well as relating these to the corpus, lexicon and, possibly, ethnography.
Implementing reference grammars along the lines proposed here would afford many advantages:
Let me close with an analogy. Let's imagine ourselves in times B.C., holding a 100 foot scroll with a reference grammar written on it. Someone observes that a scroll is a rather clumsy way to have this information, to which we respond, “It's not that difficult. We manage, don't we? And it sure beats several hundred clay tablets!” And how would we react to the suggestion that the scroll be cut into pages, numbering these and creating a table of contents and an index (whatever those are!), all for the unsubstantiated claim that this would make the information more accessible?
I believe that we are now at a similar moment. If suitable software were developed, an information structure that integrates a reference grammar, a corpus, a lexicon, and perhaps an ethnography could be developed, with remarkable benefits for both the authors and their readers.