WORKSHOP REPORT


Workshop on Web-Based Language Documentation and Description
and the
Open Language Archives Community

J. Albert Bickford
SIL-Mexico and University of North Dakota
albert_bickford@sil.org

The Workshop on Web-Based Language Documentation and Description (December 12-15, 2000, University of Pennsylvania) brought together linguists, archivists, software developers, publishers and funding agencies to discuss how best to publish information about language on the internet. This workshop, together with the Open Language Archives Community which is developing out of it, seem especially important in providing useful information about linguistics and less-commonly studied languages for both scholars and the wide general audience that can be found on the web. I hope that this report will be useful in understanding these new developments in the linguistics publishing and archiving field.

The aim of the workshop was to establish an infrastructure for electronic publishing that simultaneously addresses the needs of users (including scholars, language communities, and the general public), creators, archivists, software developers, and funding agencies. Such an infrastructure would ideally meet a number of requirements important to these different stakeholders, such as:

The workshop was organized by Steven Bird (University of Pennsylvania) and Gary Simons (SIL International).1 It included approximately 40 presentations and several working sessions on a variety of topics.

There was general agreement among the participants that a system for organizing the wealth of language-related material on the internet is needed, and that an appropriate way to establish one is by following the guidelines of the Open Archives Initiative (OAI) [http://www.openarchives.org]. (These guidelines provide a general framework for creating systems like this for specific scholarly communities.) An OAI publishing and archiving system contains the following elements:

In the case of linguistics, the system will be known as the Open Language Archives Community (OLAC). The Linguist list [http://www.linguistlist.org] has agreed to serve the system as its primary service provider. It will be the main source that people will use to find materials through the system. Further information about OLAC can be found at [http://www.language-archives.org]. The agreement to establish OLAC is probably the most important accomplishment of the workshop.

This agreement was solidified through working sessions which met during the workshop and started the process of working through the details in various areas, such as:

These and other issues will continue to be discussed on email lists in the coming months, ultimately culminating in recommendations for "best practice" in each area, together with a preliminary launch of the whole system, hopefully within a year. (Prototypes of the system are available now at the OLAC address above, along with various planning documents.)

There were also a number of conference papers, which provided a foundation for making the working sessions productive. Rather than list or review all the presentations here, I will summarize them, since they are all available on the conference website [http://www.ldc.upenn.edu/exploration/expl2000]. The topics covered included the following:

One insight that I gleaned from these presentations was a better understanding of glossed interlinear text. Interlinear text is not a type of data, but rather just one possible way of displaying an annotated text. The annotations on a text can consist of many types of information: alternate transcriptions, morpheme glosses, word glosses, free translations, syntactic structure (and possibly several alternative tree structures for the same text), discourse structure, audio and video recordings, footnotes and commentary on various issues, etc. What ties them all together is a "timeline" that proceeds from the beginning to the end of a text, to which different types of information are anchored. Aligned interlinear glosses are one way of displaying some of this information, but not the only way, and not even the most appropriate way for some types of information. The traditional arrangement of Talmudic material, for example, with the core text in the center of the page and commentary around the edges, is another possible display of annotated text, in which the annotations are associated more with whole sentences and paragraphs than with individual morphemes. There are also some sophisticated examples available for presenting audio alongside interlinear text. (For example, check out the LACITO archive [http://lacito.archivage.vjf.cnrs.fr]!)

Throughout, it was very clear that those at the conference had a great deal in common with each other:

Finally, the conference pointed out several trends that will be increasingly important in future years.

All in all, it was a workshop that was both stimulating and practical, one which will have an unusual amount of influence in months and years to come.




Footnotes:

1 Funding was provided by the Institute for Research in Cognitive Science (IRCS) of the University of Pennsylvania, the International Standards in Language Engineering Spoken Language Group (ISLE), and Talkbank.

2 Since XML's development has been closely-associated with the World Wide Web consortium [http://www.w3.org/XML/], it has been widely regarded as the successor to HTML for web pages. However, this is just a small part of its usefulness; it is a general-purpose system for representing the structure of information in a document or database, which can be customized for myriads of purposes. Many software tools are currently available for creating and manipulating data in XML, with more being created all the time. One, Extensible Stylesheet Language Transformations [http://www.w3.org/TR/xslt], can do complex restructuring of XML data.


Return to Index