SEARCHING FOR REFLEXES IN LINGUISTIC ARCHIVES: THE ENDANGERED LANGUAGE FUND INTERNET ALGONQUIAN LANGUAGES ARCHIVE (ELFIALA)

D. H. Whalen1,2, K. David Harrison1,3 and Dennis Holt1,4
1Endangered Language Fund, Inc. 2Haskins Laboratories 3Institute for Research in Cognitive Science 4Quinnipiac University
elf@haskins.yale.edu

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. Although there is an increasing amount of language material on the web, most of the concerns for the design of such databases has been language internal. For the Endangered Language Fund Internet Algonquian Languages Archive (ELFIALA), we wanted to have comparative data easily accessible as well. The Algonquian languages present an interesting test case for this, since their family relationships have been fairly well worked out and there are active projects that could make use of the information. The most compact way of making proto-Algonquian roots searchable would be to add a field to an electronic lexicon. For those texts and datasets that do not have an associated lexicon, however, this would have limited utility. Though redundant, it may be advisable to make this information available as a separate data stream in any of the major data types so that they can be searched easily whether a lexicon is involved or not. As with other fields, it would be best if the display of this root field could be toggled on and off. Making it available in as many Algonquian texts as possible would greatly aid the reconstruction of languages from historical records, especially in making sound files easily accessible.


1. INTRODUCTION

The Endangered Language Fund is beginning an archive of Algonquian material on the web, the Endangered Language Fund Internet Algonquian Languages Archive (ELFIALA). The archive will serve several purposes. One is to provide a permanent home for recorded material that might otherwise languish in desk drawers. This is particularly urgent for tapes made in the 1960's, when a relatively volatile form of plastic was used for the tapes; in a sense, some languages are going extinct again, as the recordings of their last speakers disintegrate. A second purpose is the obvious one of making the individual recordings available to heritage language learners, linguists, and other interested parties. A somewhat unusual purpose, though, is the aim of making reflexes of Algonquian roots immediately accessible, regardlessof language and dataset. These reflexes show the clear relationships among the various Algonquian languages (e.g.,Goddard, 1979), and the changes from the original forms are fairly predictable. Knowing these rules and relationships allows for the reconstruction of the roots and the projection of those roots into the daughter languages.

Many of the Algonquian languages have already become extinct, but some are in the process of being revived, and the availability of material from related languages is essential to these efforts. The Mashantucket Pequots, the Mohegans (Granberry, 1998), and the Mashpee Wampanoags (Littledoe, 1998) have all begun this process and have made varying amounts of progress. All of these languages have a fair amount of historical material available, but any revival effort needs to fill in gaps in the record. The Algonquian languages are fortunate in having many extant, related languages with well-defined genetic relationships. This allows us to infer what a form should have been in a particular daughter language, and this way of filling gaps may be more satisfying than simply borrowing a word from a related (or even unrelated) language. In order to make that process easier, it would be helpful to have as many reflexes of a particular Algonquian root as possible available directly.

The best way of providing this capability is on the web, where publishing costs are minimal and access is immediate and universal to anyone with a web link. The mark-up techniques for individual language datasets are being developed both at this workshop and elsewhere. Our plan for making the language family information accessible is to create an extra database field within the transcription of the individual language texts that lists the roots that can be found for the various words in the texts. The addition of this field can be done in advance of the actual creation of a program that would make use of it. We anticipate having a program that would link the various datasets containing these fields to a catalog of roots. Searching for a root would then bring up a list of reflexes in any of the daughter languages for which there is an instance on the web. The value to the heritage language and scholarly communities will be substantial.

2. IMPLEMENTATION CONSIDERATIONS

The most compact way of listing these proto-Algonquian roots is in an associated lexicon. Indeed, the Indiana Dictionary Database program already has a field for this (Parks et al., 1999). But making this available does not do the work that we have in mind. For many of our texts and other datasets, there will be no dictionary that accompanies the language. Thus there would be no word for the proto-Algonquian roots to attach to. Further, unless the search engine was made sophisticated enough to trace the entry from the dictionary into all of the texts of that language, the immediacy of the search would be lost. It would then be necessary to generate a set of forms in each language that might have the root being searched and perform a secondary search on those forms. Indeed, the multiplicity of derived forms makes this search problematic in itself. It may be more important to see the reflexes of these roots in a wide variety of inflections and derivations, rather than having to provide a search engine that can generate all the morphological forms of a daughter language. As with most issues in storage, this comes down to the trade-off between a compact representation that must be expanded at run time, or a distributed and redundant representation that takes more room but allows for a simpler search. Unless we plan to force every language to have a unique electronic lexicon associated with it, the distributed, redundant version strikes us as the more immediately implementable.

An alternative to this approach is to follow the English glosses throughout the datasets, but this is prone to errors of over- and under-inclusion. Many times, the most appropriate English gloss will change with different derivational forms, and so instances of a root that exist in a text could be missed. Similarly, there are times when a single English gloss covers a number of roots, and these would have to be sorted out in some other way by the search program. Given the complexity of the relationships that might be involved, this seems unlikely to be the way to go. The programming involved would require fairly detailed knowledge of the language involved, and perhaps even a recreation of the evolution of the language from the proto-language. While this would be a worthwhile thing to have, it is unrealistic to think that it would be ready at the same time that the annotated texts could be made ready. Thus it would seem wisest to allow for a layer within any of the data types, including especially those with associated audio signals, that represents the putative proto-Algonquian root directly.

Not everyone who wants to use these texts will be interested in this information, so it is essential that its display and searching be optional. This is likely to be true of virtually any stream within the data types. For example, it may be useful to have a schematic representation of the audio waveform with the associated transcription and translation of a text. For printing purposes, however, it should be possible to list only the text, only the translation, or just the two text versions, without having to print out the pictures of the waveforms as well. Thus the present proposal does not seem to require anything extraordinary in the presentation of the data.

The search capabilities should be straightforward as well. There should be ways of limiting a search to one of the fields of a data type, and this would simply be one more of those. The kinds of characters that will be included in this field will be no more complex than those used in other fields, and so any solutions adopted there will apply equally to the proposed field.

One issue that we have not made concrete decisions about is whether to try to represent morphological affixes as well as lexical roots. The morphology of the Algonquian languages is complex, and tracking the development of the affixes is one of the most intriguing aspects of work in this family. But the frequent changes that occur within some paradigms (and not others), the various ways that forms are extended or contracted in use and meaning, and the various ambiguities in some of the reconstructions make it difficult to do this work for every form. In particular, it is not always clear whether users would prefer to have just those forms indicated that are completely reliably derived from a proto-form, or if it would be more useful to have possible cases reported. Even cases where it is clear that some proto-form has been replaced with something else in a particular daughter language might be most easily traced by looking at the root itself. This is an issue that will continue to receive attention as the database grows.

The growth of the database and the changes such growth require in the representation are another general issue that we hope will be addressed in the database standards. It should be possible to have the changes that are implemented be used by both the programs and the various humans involved in the process. If all decisions had to be made beforehand, there will be very little that can be done at all. Only by using the databases and seeing what is lacking or what could be improved will we ever move beyond the theoretical stage. Thus we need to ensure that there will be a way of incorporating changes in the type of data and the way it is accessed as we become more sophisticated in the use of the programs. Although this is a general problem with all databases, the distribution of the forms across datasets for the historical forms may present some unusual difficulties. For example, if evidence for a better reconstruction becomes available, we would want to put that improved version into all the datasets where it occurs. Yet the decision about which form to use should, presumably, be left up to the maintainer of each dataset. It may be that some form of automatic notification of proposed changes could be devised so that individual owners could make these decisions. Certainly, if there were a single lexicon where the forms were stored and these were then propagated throughout the texts, this problem would not arise. But, again, it seems unlikely that such a procedure would be effective in the short term.

The algorithmic nature of the computer holds the promise that datasets such as these will make testing of suggested reconstructions almost automatic. If we choose to have the search programs do something like a reconstruction of what the string of morphemes would have been, then proposing a new form and searching for it through the database will be a potentially easy way of testing the validity of the reconstruction. Forms could be found by virtue of meaning a some of the phonological content. If the exact form does not match what the reflex indicates should have been there, we may have reason to think that the proposed reflex is wrong. As always, of course, the rules proposed for the derivation may themselves be wrong. But the thought that these processes might one day be given over to a "bot" that would search all available language data is not terribly far-fetched, and would make this aspect of linguistic research easier in some ways and more intriguing in others. The thought that the rules should be explicit enough for a computer program, for example, is assumed but often not implemented. Trying to force this issue would be of great value to historical linguistics.

But the value to those who are trying to recreate their languages is probably the most important aspect of ELFIALA. While the Algonquian languages are blessed with a richer historical record that many other families, the record must always fall short of what we need. Being able to integrate results from many related languages will make the work in any one of them far richer and more successful than it would have been otherwise. The extremely difficult job of putting primary material on the web would then have a much more commensurate reward: Work in many languages would be advanced by work on individual languages. All of this together will help push the field of linguistics and the work of language recovery to new heights in the new millennium.

3. REFERENCES

Goddard, I. (1979). Comparative Algonquian. In L. Campbell & M. Mithun (Eds.), The languages of Native America (pp. 70-132). Austin: The University of Texas Press.

Granberry, J. (1998). Learning Mohegan: A short grammar of modern Mohegan ( 2nd ed.). Uncasville: The Council of Mohegan Tribal Elders.

Littledoe, J. (1998). Wampanoag language reclamation project: First steps to healing the circle. Paper presented at the Thirtieth Algonquian Conference, Boston, MA.

Parks, D. R., Kushner, J., Hooper, W., Francis, F., Yellow Bird, D., & Ditmar, S. (1999). Documenting and maintaining Native American languages for the 21st century: The Indiana University model. In J. Reyhner & G. Cantoni & R. N. St. Clair & E. P. Yazzie (Eds.), Revitalizing indigenous languages (pp. 59-83). Flagstaff, AZ: Northern Arizona University.