Plans for a Web-based Archive
of Indigenous Languages of Latin America

Lev Michael, University of Texas, Austin

Department of Anthropology, University of Texas
University of Texas, Austin TX 78752, USA
lmichael@mail.utexas.edu


Issues surrounding the planning and design of a Web-based language data archive, the Archive of the Indigenous Languages of Latin America (AILLA), will be discussed. The goal of the AILLA project, lead by Joel Sherzer and Tony Woodbury, both of the University of Texas at Austin, is to create a database of audio and text files, equipped with a web-interface, that will make available to indigenous peoples, scholars, and students, in Latin America and North America, as well as elsewhere, unpublished or difficult-to-obtain materials from the indigenous languages of Latin America. This archive will store materials drawn from the full range of linguistic behavior - from phonetics to discourse, in the form of primary data and analyses - making accessible a very broad range of data on linguistic behavior.

The principal desiderata in planning the archive have been: 1) maximal accessibility for individuals with widely varying levels of technical expertise in the use of computers and the Internet, as well as for individuals using systems with widely varying bandwidths and living in areas with differing levels of quality of communications infrastructure; and 2) choosing file formats and a database architecture that will be usable at the present by the greatest number of people, and which will also have the longest possible life, so that the need to migrate to new formats will be minimized, and so that data files that we store by means of long-lived media, such as pressed compact disks, will remain accessible far into the future. These issues have lead us to opt for, whenever possible, file formats that have substantial cross-platform compatibility, and have free Internet-downloadable readers, or for formats for which the archive itself kind supply the necessary components to render it widely usable, such as a Unicode font set for all Archive materials. We have also found attractive widely commercially used formats, such as AIF files.

The project will initially focus on obtaining audio recordings and unpublished textual materials from linguists and anthropologists who have not archived their materials. These materials will be digitized and stored in the AILLA database, upon which the original materials will be sent to the Indiana University Archives of Traditional Music for archiving by traditional means. There are also plans to periodically press compact disks, in order to transfer the digital contents of the archive to a long-lived storage medium.

It is anticipated that audio materials will be digitized in two formats: one suitable for direct access over the Web, such as Real Audio or MP3 audio files, and the second in a format appropriate for uses that require higher sample rates, such as AIF files. Textual materials will be stored as Unicode-encoded text files.

As the project progresses, it is anticipated that the focus will be broadened to include published or archived materials that are rendered largely inaccessible due to the sheer obscurity of the place of publication or site of the archives.

The AILLA website will serve as the interface between users and the AILLA database. It will be designed in a manner that maximizes its accessibility internationally; at the outset, the site will be available in Spanish, Portuguese, and English versions, and versions in other languages will be incorporated as is feasible. Initially, two basic search capabilities are planned: 1) searches in reference to a detailed set of classifications by which all materials deposited in archive will be classified, such as language, discourse type (e.g. conversation, religious chant, political oratory, linguistic elicitation, etc.), language of commentary or analysis, type of analysis (e,g, phonetic or discourse-oriented), type of transcription or translation (e.g. morpheme-by-morpheme or free), etc. and 2) searches for character strings in text files.

We are very interested, however, in incorporating more sophisticated search capabilities. We will also be interested in implementing means for coordinating large multi-modal sets of files so that, for example, the appropriate audio file can be easily accessed from a transcription of that audio recording, or from any linguistic analysis in the archive that is based on or makes reference to either data file.

Another significant concern for us is the set of ethical and intellectual property rights issues surrounding the materials will will form contents of the archive. We intend to implement a system of graded access which will ensure the protection of the intellectual property rights associated with the materials in the archive, by allowing scholars and indigenous people to decide on access and use criteria for materials they deem sensitive.


Linguistic Exploration Workshop