Metadata for Linguistic Documentation Archives

Gary Holton
Alaska Native Language Center
gary.holton@uaf.edu

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. A system of metadata for the description of resources in linguistic documentation archives is proposed. The system is based on an extension of the Dublin Core metadata initiativ e, which is fast becoming the standard for electronic cataloguing. Several new elements are proposed in order to adequately describe linguistic resources. A set of best practic e guidelines is developed both for existing elements and proposed additional elements. The most crucial descriptor needed for describing linguistic data is a set of elements referencing the Target Language, i.e., the language described in the resource. This element must reference a standard database of language names in order to facilitate identification of documents. However, it must be sufficiently flexible and extensible to permit ident ification of languages which are not included in existing language name databases.


1 INTRODUCTION

This paper proposes a system of metadata for the description of language documentation resources. While the system described here should be sufficient for any linguistic resou rce, it is motivated by the specific ongoing need to describe linguistic documentation materials contained in the Alaska Native Language Center (ANLC) Archive. Particular attenti on is paid to description of first-hand documentation materials such as field notes, grammatical notes, and phonological descriptions, many of which currently exist only in writt en form. Existing resources are in the process of being digitized, and new digital resources continue to be acquired.

The Alaska Native Language Center houses a collection of manuscripts and publications in or about Alaska Native languages and related languages in Russia, Canada, and Greenla nd. The collection presently contains more than ten thousand items. While much of the material consists of original manuscripts of archival quality, the collection also includes published materials and materials existing in other archival collections, duplicated in whole or in part. The ANLC Archive thus combines both archival and library functions. An h istorical overview of the collection is found in Krauss & McGary 1980.

Many of the published materials which are included could not otherwise be easily identified as containing material on Alaska Native languages. It is their inclusion in the col lection, and their associated metadata descriptions, which identifies them as Alaska Native language resources. For example, typical metadata descriptions of Wrangell 1839 fail t o capture the important linguistic content of this published work. In fact, this publication includes the earliest recorded word list of the Tanacross language, written in Cyrill ic orthography with German glosses and identified as "Copper River Kolchan". While this vocabulary occupies only three pages, its linguistic value is inestimable. And its existen ce can be made known only by supplying appropriate metadata descriptions which reference Tanacross.

This brings up an important point, addressed in some detail in the final section of this paper. Metadata for linguistic documentation resources require a flexible and extensib le system for accurately identifying the language described in a particular resource. Improper identification of the language inhibits sharing of data by potential collaborators and may even obscure important resources.

The system proposed here builds on the Dublin Core metadata standard, proposing several additional elements necessary for the description of linguistic documentation resources . Should additional metadata descriptors be necessary for other subfields of linguistics, a union set of metadata descriptors can be developed in cooperation with other researche rs.

2 PROPOSED USES FOR EXISTING DUBLIN CORE ELEMENTS

Many of the existing DC elements can be used in their existing form, provided a recommended best practice is established for their use by the linguistic documentation communit y. These elements are discussed in this section. Where no special use of an existing DC element is made, that element is not discussed here.

2.1 Creator

For linguistic documentation there is potential ambiguity as to whom the Creator element should refer. Semantically this could be either the linguist or the speaker. A recordi ng or text might have the speaker as creator. But what about field notes recorded by one linguist from several (possibly unidentified) speakers? For the purposes of standardizati on it is recommended that the DC Creator element refer to the entity responsible for creating the resource in its final form. Additional elements can then be used to refer to the speaker and interviewer.

2.2 Format

Since many existing archival materials are still in the process of being digitized, there is a need to develop metadata which will apply equally to digital and non-digital res ources. The existing MIME data types may be sufficient for describing digital resources, but non-digital resources require an additional vocabulary. Suggested minimum would inclu de: manuscript, reel-to-reel recording, cassette recording, DAT recording, CD recording.

2.3 Identifier

For published materials existing resource identifiers, such as ISBN, may be used to identify the resource. For internet-published materials, web addresses may serve as suffici ently unique identifiers, though some caution is advised due to the transient nature of such addresses. For unpublished resources, the Identifier element should refer to propriet ary resource identifiers, preferably prefixed with an string identifying the entity responsible for assigning the proprietary identifier. The entity responsible for assigning the proprietary identifier should supply an explanation of its cataloging scheme. In most cases this entity will be the same as the location of the resource. For example, the string ANLC TC997H1998a refers to the ANLC catalog number TC997H1998a. ANLC would then be responsible for maintaining this resource identifier system.

2.4 Language

As stated in the DC element descriptions, the language element refers to the language of the intellectual content of the resource. The language which is being described will b e referred to with an additional element, called Target Language. The 2-letter codes found in RFC1766 may be sufficient for identifying the Language, but it would be better pract ice to use the same codes employed for the Target Language element.

2.5 Source

In many cases identical (or near identical) resources will exist in multiple locations. This is particularly true of resources which are relevant to multiple collections. The DC Source element can be used to identify the original resource from which a given resource is derived.

2.6 Publisher

For archival resources the DC Publisher element can be used to identify the archive responsible for holding and maintaining the resource. This entity will in most cases be th e same as the entity responsible for assigning the identifier in the DC Identifier element.

3 PROPOSED EXTENSIONS TO THE DUBLIN CORE ELEMENT SET

In order to permit effective use of the DC element set by the linguistic documentation community, several additional elements are proposed. All of these proposed elements can be repeated.

3.1 Target Language

Since the DC Language element must refer to the language of "intellectual content", a distinct element is necessary to refer to the language described in the resource. The pro posed Target Language element identifies the language which is documented in the resource. The target language is identified both by a code chosen from a defined namespace, such as Ethnologue, and by a textual name, which may or may not refer to the textual name given in the namespace. In order to effectively implement the target language element using the name/value pairs of the DC format, a bundle of related elements are required, as described in the appendix. Recommended best practice for the use of this element is dis cussed at length in the following section.

3.2 Target Dialect

In many cases the Target Language element will not sufficiently identify the speech variety represented in a particular resource. In these cases the Target Dialect element can be used to further differentiate the speech variety. The Target Dialect element is structurally parallel to the Target Language element except that it does not employ a namespac e for unique identification. Since the identification of dialects represents a degree of specialized knowledge, recommended best practices for naming dialects should be establish ed by sub-communities specializing in particular language areas.

3.3 Speaker

The Speaker element identifies the person responsible for producing the target language in the resource. In the case of a spoken text or recording this is the person who speak s. In the case of field notes this element refers to the consultant who is being interviewed.

3.4 Interviewer

The Interviewer element refers to the person responsible for collecting the data contained in the resource. In many cases this person will be the same entity as that referred to by the Creator element. However, there is often a need to distinguish between the Creator and the Interviewer. For example, a linguist may provide transcription and interlinea r glosses for a text which was collected by another person from a speaker. In this case the Creator, Interviewer, and Speaker would all refer to different entities. As another ex ample, a single investigator may oversee collection of recordings by several different recorders. In this case the principal investigator would be the Creator for each resource w hile the recorder of each resource would be the Interviewer.

3.5 Genre

The Genre element describes the form of the interaction recorded in the resource, such as conversation, interview, personal narrative, folk tale, etc. Recommended best practic e is to use a limited vocabulary of genre types, such as that proposed by Lev Michael for the AILLA project.

4 RECOMMENDED BEST PRACTICE FOR LANGUAGE IDENTIFICATION

One of the most significant problems for linguistic documentation is the need for a consistent and reliable method of language identification for the purposes of assigning a v alue to the Target Language element. In order to make use of a given resource a researcher needs to be able to easily determine which language the data in that resource are deriv ed from. Unfortunately, three types of problems complicate this issue:

I discuss each of these problems below.

4.1 Synonymy

The first problem arises because not everyone uses the same name to refer to a language. For example, the names Kutchin and Gwich'in clearly refer to the same language spoken in northeast Alaska and northwest Canada. This type of issue can be readily addressed by simply selecting a standard label and providing a list of language name synonyms. However , in other cases the issue is not so clear. Consider the "Copper River Kolchan" vocabulary mentioned in the introduction above. Experts in the field have identified the lexical i tems in this resource as likely being of Tanacross origin, but this represents a secondary analysis. We have no ethnographic information about the speaker from whom the data was collected. So it remains possible that this data may refer to a distinct heretofore undocumented language. The equation of "Copper River Kolchan" with Tanacross is not nearly as easy to support as the equation of Gwich'in with Kutchin. By equating "Copper River Kolchan" with Tanacross we may be discarding important linguistic information. Any system of l anguage naming should at least retain a reference to the original "Copper River Kolchan" label.

4.2 Language Change

The second problem affecting language identification arises because languages change. At some point we have to decide whether to include older and newer varieties of the "same " language under the same language name. For example, are Modern English and Old English both varieties of the the same language? In this case we are likely to treat them as dist inct languages, because we are familiar with these different labels. The answer is less clear for unfamiliar languages. Should we equate Biblical Gwich'in (known as Dagoo) with M odern Gwich'in? Old Javanese with Modern Javanese? Clearly some arbitrary lines will need to be drawn and allowances made for distinct language names for various historical perio ds of a single genetic language. These distinctions will have to be made by the relevant sub-communities rather than by a central naming authority.

Related to this problem is the issue of historical reconstruction. While a proto-language is technically a theoretical construct rather than an actual language, resources rela ting to proto-languages, such as lexicons and grammatical description, can be extremely useful to descriptive linguists working with daughter languages and could thus arguably be included in the list of language names. Since the progress of reconstruction is ongoing, the responsibility for devising names for proto-languages should be left to the relevant sub-communities rather than to a central naming authority.

4.3 Lack of Clear Defining Criteria

The third problem affecting language identification arises because of the lack of clear defining criteria for distinguishing languages. It is one of the ironies of linguistic s that the primary object of linguistic study is so hard to define. There exists many competing sets of criterial definitions for delimiting language, including the widely-cited concept of mutual intelligibility. The problem with this and any other method of delimiting languages is that the act of defining a language represents a secondary level of analy sis, rather than a primary, observable fact. Different researchers may come to different conclusions about whether or not two speech varieties represent the same language. This m eans that any global system for defining language names will fail to be entirely satisfactory.

While there are a number of generally accepted criteria for defining languages, in practice these are very difficult to operationalize. Consider for example the widely-accepte d criterion of "mutual intelligibility". By this criterion two speech varieties are said to be distinct languages if a speaker of one variety cannot communicate with a speaker of the other variety without considerable practice. In theory this criterion is very attractive; in practice it is very difficult to apply. The difficulty arises from two sources. First, there is no established methodology for determining mutual intelligibility, so assessments of mutual intelligibility are ultimately subjective. The second, related difficu lty arises from the phenomenon of bilingualism. It is not at all clear how one is to distinguish between mutual intelligibility and bilingualism. Many, if not most, speech commun ities are inherently multilingual, in that most of the members speak more than one language variety. For such communities the criterion of mutual intelligibility fails to differe ntiate the different speech varieties. Indeed, for such communities the criterion is inapplicable.

To illustrate these problems, we can consider some examples from the well-known classification of languages found in the Ethnologue (Grimes 1996). By singling out the < i>Ethnologue here I intend no criticism of this work per se; rather, I merely wish to illustrate some difficulties inherent in all language enumeration schemes. Due to its familiarity within the language documentation community the Ethnologue has also been proposed as a standard vocabulary of language names by many authors (cf. Constable & Simons).

First consider the definition of mutual intelligibility. In theory the determination of mutual intelligibility should rely on an assessment of comprehension between speakers o f different speech varieties. In practice the determination is often based on cultural rather than linguistic features. This appears to be the case for the various speech varieti es found on and around the island Halmahera in eastern Indonesia. Several of these speech varieties are clearly members of the Papuan family, differing sharply from neighboring A ustronesian languages. The most rigorous attempt at linguistic classification for the area is found in Voorhoeve (1988). Voorhoeve operationalizes the distinction between languag e and dialect based on a shared percentage of cognates on a one-hundred-item basic vocabulary list. The criterion of shared cognation percentage should of course not be considere d equivalent to mutual intelligibility, but it does serve as a convenient cross-check. But for the Papuan languages of Halmaheran, the two criteria appear to yield radically diff ering results.

The Ethnologue lists sixteen distinct Papuan languages in Halmahera where Voorhoeve lists only four. Some of this difference in numbers of languages can be accounted fo r in the difference between the defining criteria. However, at least some of the difference must be attributed to arbitrariness in the assessment of mutual intelligibility. For e xample, the Ethnologue lists Sahu and Waioli as distinct languages even though Voorhoeve records a shared cognation percentage of 98%. In this case it seems that the Et hnologue distinction must reflect a distinction between tribal group affiliation rather than a lack of mutual intelligibility. This notion is borne out by the labels Tobelo a nd Tugutil, also listed as separate languages in Ethnologue. According to local parlance, the term Tugutil refers to a group of unassimilated "forest people" who maintain a hunter-gatherer lifestyle in the mountainous interior of Halmahera. The term Tobelo is reserved for the settled villages of coastal Halmahera. However, in my own fieldwork in t he area I have never observed Tobelo and Tugutil speakers having any difficulty communicating (see also Duncan 1998).

The second problem in determining mutual intelligibility can be illustrated by the Athabaskan languages of eastern Alaska. This region was historically populated by several sm all nomadic bands with a high degree of contact between the bands. While these bands are now settled in villages, contact between bands and concomitant bilingualism has been main tained. A speaker from the village of Tanacross can converse without much difficulty with a speaker from the neighboring village of Northway. Perhaps for this reason the Ethno logue lists both Tanacross and Northway under the single language label of Upper Tanana. However, most speakers from Tanacross have been exposed to speakers from Northway sin ce birth. Yet in spite of this "mutual intelligibility", Tanacross speakers readily identify Northway speakers as speaking a different language. The speakers' assessments in thi s case agrees with commonly-accepted categorization of the languages based on linguistic criteria. For example, Tanacross has high tone where Northway has low tone, e.g., Tanacro ss k'étmàh vs. Northway k'àtbáh 'ptarmigan'. Real linguistic differences persist in spite of pervasive bilingualism.

4.4 Need for Extensibility

Linguists will of course continue to argue the merits of any proposed vocabulary of language names. Therefore, such a system must strive not for theoretical soundness but rath er for maximum flexibility. A vocabulary of language names must be extensible in order to accommodate the needs of field workers who will provide linguistic data. Returning to th e examples above, linguistic data from Northeast Halmaheran and Tanacross needs to be identified in a way which does not compromise its legitimacy.

On the other hand, the need for legitimate language identification must be balanced against the need for standardization of language names. A compromise can be achieved by all owing both a standard identifier from an established namespace such as Ethnologue and a readable name which is not restricted to a specified vocabulary. TA similar proposa l is found in the ISLE proposal for the language subelement. We propose a modified structure as follows. (Note that this definition assumes a hierarchical metadata structure such as that embodied in the ILSE proposal but can also be readily modified to follow a DC element structure, as is done in the appendix below.)

LanguageLanguage elements
IdStandard language identifier,
e.g., Ethnologue code or ISO 639-2 code
NamespaceNamespace authority
NameUser-defined readable name
DescriptionElaborate description of language
SourceReference to source identifying this language name

Here the Id element can be chosen from any established namespace authority, such as the Ethnologue or ISO 639-2 [ ISO639]. The authority used is identified by the Namespace tag. The Name tag provides a space for a full readable and understandable language name. Recommended best practice is to use the primary name associated with the Id by the namespace authority. Thus, if using the Ethnologue namespace to identify Kutchin, the code KUC would be used with the name Gwich'in, which is the first name listed in the description of this code.

However, where the namespace is over-differentiated deviations from best practice allow a user-defined name to be supplied for the Name tag. For example, one might use Northea st Halmaheran for the Name tag and choose an Id value from amongst the Ethnologue codes associated with Northeast Halmaheran. Similarly, where the namespace is under-diffe rentiated, the Name tag can provide finer distinctions. The Ethnologue lists Tanacross as a dialect of Upper Tanana, so the Upper Tanana Id (TAU) can be used together with the Name value Tanacross. These choices can be elaborated in the Description tag and the Source tag.

This system can also handle the problem of "unidentified" languages such as the Copper River Kolchan, as well as historical reconstructions, i.e., proto-languages. This can be achieved simply by allowing the Id tag to be optional. In other words, where no appropriate Id code exists in any namespace authority, the language can be identified solely by t he Name tag. This deviation from a standardized vocabulary may seem to invite chaos, but it is likely that individual sub-communities will establish informal best practice recomm endations for these names.

5 OUTLOOK

The development of an electronic infrastructure for the sharing of linguistic data will greatly enhance cooperative research efforts in language documentation and description. In order to accomplish this task, not only must we develop effective means for cataloguing and storing new data, we must also provide a consistent mechanism for cataloguing exis ting collections of linguistic data. This latter step is necessary both in order to avoid duplication of descriptive effort and in order to facilitate collaboration between langu age workers. The system proposed here represents a prelimary step toward that goal. A system of metadata based on the DC initiative was chosen both because of the simplicity of the name/value pair format and the increasingly wide acceptance of the DC standard. However, the DC format has several drawbacks, particularly due to its flat non-heirarchical structure. Furt her work to refine the current proposal should consider both the strengths and weaknesses of this approach.

APPENDIX: ELEMENT DESCRIPTIONS

Element: Target Language ID
Name:Target Language ID
Identifier:Target Language ID
Definition:Code referring to the language name.
Comment:Recommended best practice is to use a code from a recognized namespace authority. See Target Language Namespace.
Element: Target Language Namespace
Name:Target Language Namespace
Identifier:Target Language Namespace
Definition:Reference to the namespace authority from which the language identification code is chosen.
Comment:Examples are ISO 639-2 and Ethnologue.
Element: Target Language Name
Name:Target Language Name
Identifier:Target Language Name
Definition:Full name of the target language.
Comment:Recommended best practice is to use the full name associated with the language code by the namespace authority. Where the namespace authority does not suffici ently identify the language, this element can be used for clarification.
Element: Target Language Description
Name:Target Language Description
Identifier:Target Language Description
Definition:Full description of the language name.
Comment:
Element: Target Language Source
Name:Target Language Source
Identifier:Target Language Source
Definition:Name of the source which identifies the name of the language.
Comment:Recommended best practice is to use the name of the namespace authority. Where a different authority is used to justify the chosen language name, a reference to the resource which identifies the name should be used.
Element: Target Dialect
Name:Target Dialect
Identifier:Target Dialect
Definition:Name of the dialect described in the resource.
Comment:
Element: Speaker
Name:Speaker
Identifier:Speaker
Definition:Name of the entity or person speaking in the resource or upon whose speech the resource is based.
Comment:
Element: Interviewer
Name:Interviewer
Identifier:Interviewer
Definition:Name of the entity conducting the interview or recording session.
Comment:
Element: Genre
Name:Genre
Identifier:Genre
Definition:Description of the communicative genre represented by the resource.
Comment:Recommended best practice is to use a restricted vocabulary of genre types.

BIBLIOGRAPHY

Constable, Peter & Gary Simons. 2000. Language identification and IT: Addressing problems of linguistic diversity on a global scale. SIL Electronic Working Papers 2 000-001. [http://www.silewp.org/2000/001/]

1999. Dublin Core Metadata Element Set, Version 1.1: Reference Description. [http://purl.org/dc/documents/rec-dces-19990702.htm]

Duncan, Christopher. 1998. Ethnic identity, Christian conversion and resettlement among the Forest Tobelo of northeastern Halmahera, Indonesia. Ph.D. dissertation, Yale University.

Grimes, Barbara. 1996. Ethnologue, 13th edition. Dallas: Summer Institute of Linguistics. [http://www.sil.org/Ethnologue]

ISLE. 2000. ISLE Meta Data Elements for Session Descriptions Proposal. [http://www.mpi.nl/ISLE/documents/draft/ISLE_Metadata_2.0.pdf]

Krauss, Michael E. & Mary Jane McGary. 1980. Alaska Native Languages: A Bibliographical Catalogue: Part One: Indian Languages. (Alaska Native Language Center Resear ch Paper 3.). Fairbanks: Alaska Native Language Center.

Voorhoeve, C. L. 1988. The languages of the northern Halmaheran stock. Papers in New Guinea Linguistics, no. 26., 181-209. (Pacific Linguistics A-76). Canberra: Austral ian National University.

Wittenburg, P., D. Broeder & B. Sloman. 2000. EAGLES/ISLE: A proposal for a meta description standard for language resources. LREC Workshop, Athens.

Wrangell, Ferdinand Petrovich von. 1839. Statistische und ethnographische Nachrichten über die Russische Besitzungen an der Norwestküste von Amerika, ed. by K .G. von Baer & Gr. von Helmersen, 101-03, 259. Osnabr|ck: Biblio Verlag.