RFC: REQUIREMENTS ON THE INFRASTRUCTURE FOR
DIGITAL LANGUAGE DOCUMENTATION AND DESCRIPTION

Gary Simons and Steven Bird
Draft: 14 November 2000


About this document. This document has been prepared in conjunction with the workshop on Web-Based Language Documentation and Description, held in Philadelphia on 12-15 December 2000. This document is a request for comment (RFC), and it is being circulated to workshop participants in advance of the meeting. A revised version of these requirements will be discussed and revised by the working groups at the meeting.


Table of Contents

1. INTRODUCTION
2. REQUIREMENTS
  2.1. User Requirements
  2.2. Creator Requirements
  2.3. Archivist Requirements
  2.4. Developer Requirements
  2.5. Sponsor Requirements
3. REQUEST FOR COMMENTS


1. INTRODUCTION

Recent years have witnessed dramatic advances in digital storage and digital publication technologies, making it possible to house virtually unlimited quantities of linguistic data online, and to disseminate this data in digital form on CD-ROM/DVD and over the web for negligible cost. The development of XML and Unicode greatly facilitate the interchange and reuse of structured multimodal and multilingual data and the development of interoperating software tools. These developments are having a pervasive influence on the way primary linguistic data are gathered, stored, analyzed and disseminated, as part of projects to document and describe languages, and they present major new challenges for modeling, creating, archiving and accessing this data.

So far, these challenges are being addressed by the "language documentation community" in a fragmentary manner. Given the scarcity of resources and the scale of the challenges, the best approach would seem to be one in which the whole community collaborated on designing and constructing a shared infrastructure.

In moving towards this shared infrastructure, we envision three stages of increasingly substantive agreement on the nature of the infrastructure:

  1. Requirements. What properties would the ideal digital infrastructure possess?
  2. State of the Art. What is the current state of the art, and how well does it meet these requirements?
  3. Best Practices. In light of the above, what are the recommended best practices for modeling, creating, archiving and accessing language documentation?

The present document focuses on the first of these, the requirements. We can identify at least five special interest groups that would want to levy requirements on the enterprise:
users The people who want to access language materials which have been stored away in archives.
creators The people who create and archive language materials.
archivists The people who manage the process of acquiring, maintaining, and accessing the information resources stored in archives.
developers The people who create data models, tools and formats for storing and manipulating digital language documentation.
sponsors The organizations that fund the creation of information resources and their maintenance in archives.

In this document we attempt to enumerate the requirements for each of these groups with respect to the total infrastructure required to support digital language documentation and description.

We have stated these requirements at a high level. Each requirement could itself be expanded into a set of more detailed requirements, however this can be left to a later stage. Note that we provide both positive and negative versions of the requirements.

During the Philadelphia meeting, participants will be involved in working groups which will discuss and revise these requirements. We hope to come out of the meeting with a requirements document which represents the consensus of the community.

2. REQUIREMENTS

2.1. User Requirements

While online language archives hold the promise of unparalleled access to information, it also presents the specter of unparalleled chaos as information resources pop up in every corner of the world-wide web. The following statements describe the state of the world that users of online language archives would like to see, as opposed to the contrasting chaotic state that might be more likely to be realized in the absence of deliberate efforts to prevent it.
  What users want What users don't want
1. There is a single site on the Web where any user can go to discover what language information resources are available, regardless of where they may be archived. The only way to discover language resources on the Web is to visit all the individual archives or to hope that the resources one is interested in have been indexed in an intuitive way by one's favorite general-purpose search engine.
2. All language resources (regardless of where they may be archived) are catalogued with a consistent set of metadata descriptions, so that the user can ascertain all the basic facts about a resource without having to download it. The only way to get a good idea about what a resource contains, who is responsible for it, or what are its terms of availability is to retrieve it.
3. Uniform metadata descriptions can be used to perform focussed searching of language resources by metadata categories regardless of where on the Web they may actually be archived. There is no way to reliably search on metadata categories for language resources since there is no standardized framework for describing resources.
4. All language resources (regardless of where they may be archived) are tagged in a consistent way to identify the languages they relate to, so that a single search for a particular language will retrieve all relevant resources on the Web. The only way to find resources in or about a particular language is to depend on keyword searching. This will fail when the language a resource is in is not identified by a keyword, or when different submitters supply different names for the same language.
5. When a user discovers the existence of a resource, full information is available on how to obtain the resource, and on any restrictions concerning the use of the resource. The resource can be obtained in a timely fashion. The user requests the resource, and discovers after some delay that the owner of the resource is not prepared for this particular user to have the resource, or places previously undisclosed restrictions on the use of the resource.
6. The archived information is self-documenting with respect to how it is electronically encoded. There is no obvious way to find out what the encoded characters or the markup tags in a resource represent.
7. When users obtain a resource, they see the same thing that the submitter originally saw. The user cannot properly view the resource for lack of fonts, stylesheets, or the right rendering software.
8. For any given resource, it is possible to find the software tools that are appropriate for querying it or for converting it to another format. Users cannot do anything with resources they download since they are not in a format they can use.
9. For any given resource, there is a unique and durable method for citing it. Users may refer to a resource in a variety of ways, and none of these references is guaranteed to work indefinitely.

2.2. Creator Requirements

While inexpensive and powerful computers have dramatically lowered the bar for would-be creators of digital language documentation, only limited software support is presently available for the special needs of this user community. The following statements describe the state of the world that the creators of digital language documentation would ideally like to see, in contrast to the frustrating state which is all too familiar to this community.
  What creators of language documentation want What creators of language documentation don't want
1. For each of the descriptive and analytical practices widely used in language documentation, inexpensive software is available which provides a suitable user interface and which ensures data integrity. Unsuitable general-purpose software tools, such as word processors and spreadsheets, must be purchased at a significant cost. There is no built-in support for the particular task and data consistency must be checked manually.
2. The language documentation created with the software is in a form suitable for immediate archiving, so that archiving digital data is essentially no more difficult than making a particular kind of backup. Non-trivial reformatting is required before the language documentation can be archived, with the result that users often don't get around to archiving their work.
3. Incomplete or uncertain information representing the state of the investigator's current knowledge about the language under study can be archived. Language documentation can only be archived when it meets certain quality requirements.
4. It is straightforward to reuse existing language documentation in the process of generating new documentation and new descriptive works. Conversion utilities support the import of data created by non-conformant software. Creators of language data have to reformat and restructure the existing archived language documentation before they can use it in creating new documentation.
5. Extensive up-to-date guidelines and recommendations on hardware, software, formats are available, in an easy-to-use cookbook format. There are a bewildering array of options available to would-be creators of digital language documentation, and there is minimal advice on the choices and combinations which are suitable for a given data gathering situation and budget.
6. The creator of the language documentation has moral obligations to the speakers and/or the language community. When archiving the data, s/he can select from a range of contracts with the digital archive which establish access and usage rights in perpetuity. The descriptivist is unable to constrain access or usage once the material is archived, and a subsequent undesired use has resulted in the descriptivist being denied access to the language community.

2.3. Archivist Requirements

While digital archiving holds the promise to archivists of offering unparalleled access to the information they curate, the range of issues involved in doing the job well is so wide, and for many archivists so technical, that it is virtually impossible for any one archive to master them all. The following statements describe the state of the world that archivists would like to see, as opposed to the contrasting state that might be more likely to be realized in the absence of deliberate efforts to prevent it.
  What digital language archivists want What digital language archivists don't want
1. There is a set of best-practice guidelines in use by the language archiving community that can be followed to ensure that the digital data stored in the archive will be encoded in such a way as to be maxiamally useful. Each archive must study the issues of character encoding, data markup, and file formats and develop its own standards.
2. There is a repository where the archivist can find an off-the-shelf software system for implementing the archive catalog. Such a system would allow for entering and maintaining catalog records, supporting a public-access catalog viewer on the Internet, and sharing of metadata records with services that provide union catalogs of multiple archives. Every archive is on its own to find or develop the software needed to build and disseminate its catalog.
3. There is a repository of off-the-shelf software tools that archivists can use to test items they accession for conformance to the best-practice encoding guidelines. Each archive must hunt for tools that will help them maintain encoding standads in their collection. Failing that, they would either develop their own tools or do without.
4. There is a standard that all archives can follow to devise and maintain a public identifier for each archived item that is guaranteed to be globally unique and globally persistent (even when the URL for where an item is stored may change). Every archive must work out its own system for building unique identifiers and for guaranteeing persistence. The resulting identifier has meaning only within the context of the particular archive, but not on a global scale.
5. There are community-wide best-practice guidelines all archives can follow in matters of ensuring informed consent, upholding intellectual property rights, and addressing other ethical issues, as these relate to contributing researchers, language communities, funding agencies and institutional review boards. Each archive must study these issues on its own and develop its own guidelines. In some cases, institutional paranoia prevents any dissemination of digital holdings.
6. A straightforward way to provide finding aids to local and remote users, and other archives, exporting metadata records from an inhouse database to an external, widely used format. Maintaining finding aids in a plethora of formats overburdens the resources of the institution.
7. Linguists can deliver digital data and documents in a form that is immediately archivable and disseminable. New materials require considerable manual processing before they are in a format that will be accessible in future, once current versions of the software used to create the data no longer exist.
8. Technical support in the form of appropriate software and documentation is available for converting old archive holdings into digital form. The solution is economical, making it easier for archives to bid for funds to digitize their holdings. Digitizing any resource requires new one-off solutions to be developed. Pitfalls with new equipment lead to costly delays and possible abandonment of the process.
9. Archives can get useful and timely feedback and evaluation from remote users concerning archive services. Archives learn indirectly that remote users are dissatisfied with the archive services, and are unable to respond effectively.
10. Tools exist to convert parochial 8-bit character codings to Unicode, and to convert markup into the best practice formats. Legacy data requires expensive and time-consuming manual processing.

(Questions: does a distinction need to be drawn between digital holdings and digital surrogates of non-digital holdings, for the purposes of the infrastructure? Are there any requirements concerning other metadata standards, like the EAD?)

2.4. Developer Requirements

Developers often work in relative isolation from the wider community, serving the needs of a particular descriptivist. While the descriptivist understands the linguistic domain, s/he typically has only limited understanding of data modeling and software development. The specification for the software tool may be too vague, or else too specific, thereby limiting the software in unnecessary ways.
  What developers want What developers don't want
1. Explicit, widely-accepted data models exist for the different types of documentation. Each programming task requires a developer to investigate the full range of cases likely to be encountered, depending on a descriptivist who does not know which aspects of representation are likely to cause problems for a computational model.
2. Re-usable low-level components exist for standard kinds of media display and data creation. Public application programming interfaces (APIs) make it easy to develop tools on top of these components. Every component of the system must be assembled from scratch.
3. Standard formats exist for data storage and interchange, and come with standard APIs. For each data type it is necessary to craft a new format. For each pair of programs needing to exchange data it is necessary to create a format conversion tool.
4. Data models exist for representing information which is partial, uncertain, evolving, or disputed. Kludges and workarounds are necessary in order to represent incomplete data.
5. Suitable web delivery mechanisms exist, including stylesheets, conversion tools, indexing methods, query languages, streaming media technologies, and so on, and these encompass the needs for security and privacy. Language materials cannot be disseminated on the web, since the delivery and rendering methods are non-existent or ineffective.
6. Developers can easily discover and obtain any existing data models, formats and tools that support a particular kind of language documentation. The state of the art is undocumented, and developers waste resources reinventing pieces of the common infrastructure.

2.5. Sponsor Requirements

While government funding agencies and non-governmental organizations have long sponsored language documentation projects, they face new challenges as they develop priorities and programs concerning digital language documentation. Funded initiatives need to be cost effective, and the resources they create need to be disseminated and reused by the community.
  What sponsors want What sponsors don't want
1. An institution that sponsors the creation of language resources may want to host those resources on its own website. Any resource that is to be made available to the public in a uniform way through a common catalog must be hosted at a single site.
2. The resources a sponsor has helped to develop are being widely used by the community. The funding was in vain, since the community cannot discover that the resources exist, or if they happen to find them, they cannot use them for lack of documentation, proper encoding practice, fonts, software, and the like.
3. Software tools which were developed under a funded project have been thoroughly documented and distributed with an open source license, and have been successfully adapted for new projects. Expensive programmer time was partially wasted, since the software generated by the project was never made available, and new funded projects had to repeat the same work.
4. There is an online peer-review process for language archives and documentation projects, concerning the quality and availability of the materials. It is difficult to determine the extent to which the language documentation community values and uses the materials provided by language archives and documentation projects.

3. REQUEST FOR COMMENTS

The above list of requirements is a preliminary attempt to describe the state of the art that we are hoping to see in the future. Of course, there is no end to the detailed, low-level requirements that could be listed. Here we seek to collect general, high-level requirements which will be used to inform our collective work on the digital infrastructure and our quest for best practices.

We solicit comments on this document in advance of the Philadelphia meeting, especially in the form of refinements to these requirements, or new requirements that are formulated in the same way.

Please send comments to Gary Simons and Steven Bird, by Friday 1 December 2000.