Towards a Model for Web-based Language Documentation and Description: Some Contributions from Digital Libraries and Humanities Computing Research

Susan Hockey
School of Library, Archive and Information Studies, University College London
s.hockey@ucl.ac.uk

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. Creating a Web-based infrastructure for archiving language documentation and linguistic resources is a challenging task because of the role of linguistics as a metadiscipline, and the range of research and theoretical views in the discipline. It should be possible to draw on traditional models in library and archive science to describe and classify some aspects of the resources, thus providing access in the library sense, but the infrastructure could also provide access in terms of analysis and manipulating the resources. The Text Encoding Initiative header is examined in some detail as an example of a metadata structure for electronic textual material. Further parallels might also be found in the archival community, both in traditional archival practice and in archiving of electronic datasets.


1. INTRODUCTION

The proposed web-based infrastructure for collecting, storing and disseminating language documentation and description is an interesting and much needed initiative. It brings together the components of a digital library of linguistic data, and, with its aim to address the complete range of human languages, it has the potential to encompass a very wide range of activity. The infrastructure could provide a model for many other groups who work with textual material and other electronic resources and datasets, but the linguistic data community is not alone in attempting to build such an infrastructure. Where appropriate, it could also draw on methodologies developed in other disciplines, notably library and information studies, and humanities computing and data archiving.

This initiative is particularly interesting because linguistics is a meta-discipline. It impacts on almost everything that is done in our daily lives. What is developed as a result of this workshop may have implications throughout the scholarly community and beyond, and, in my view, it does need somehow to be capable of linking to other work. There is also a vast amount of other material on the Internet that could be used for linguistic analysis. Some of this is on individual's web pages; other material is delivered by publishers as subscription services and this includes primary source texts, and, perhaps of most interest to this community, reference works such as dictionaries. In this short paper I will attempt to situate the initiative within a broader perspective and to comment on work done elsewhere that might be adapted as part of the initiative.

2. AUDIENCE

Simons and Bird (2000) identify the primary audience as "the people who want to access language materials which have been stored away in archives". This is probably intended to mean scholars and researchers, but in linguistics even this can be very broad. The research culture varies from work done collaboratively in large and well-funded research groups in computational linguistics to individual scholars who examine specific texts and data in great detail. In some areas of this research, the emphasis is on methodology; in others it is much more on the interpretation of results. There are a wide range of theoretical approaches in linguistics, all of which may need to be accommodated. Linguistics can perhaps be described as in interdisciplinary subject straddling the sciences and the humanities. It also incorporates a historical dimension with studies of ancient languages and of the development of language. As the Web reaches a wider audience in schools, homes and in more countries around the world, the infrastructure has the potential to reach communities well beyond the primary audience. It could impact on teaching at all levels, on lifelong learning, and on cultures that have evolved in a different way from those in Europe and North America.

3. MODELS OF RESEARCH

In the traditional model of research there is a distinction between infrastructure for locating information and infrastructure for analysing and manipulating that information. Most commonly a researcher will begin a project with a literature search using a library catalogue and bibliographic databases. He or she will use the information derived from the literature search as a basis for further research. For computer-aided linguistic research, field work or experimentation or other data collection methods will then generate a dataset or resource which is manipulated and analysed by a combination of software tools and human judgement. In this model I think that the only real link between the information derived from the literature search and the analysis of the data is the intellectual rationale for the project. This is something that is not usually publicly expressed in writing until the research is complete and the results are published.

As more and more resources become available in electronic form, we look to the availability of electronic and Web-based tools to help us find them. We are now in the situation where there are a number of different repositories of linguistic and textual data providing an infrastructure each of which is described and documented by their own catalogues or finding aids. These tend to provide access, in the library sense of locating information, to linguistic resources. The researcher takes the resource and does whatever he or she wants to do with it independently of the repository. If this activity results in an enhanced version of the resource, the new version is often returned to the repository. In this case something has been done to the intellectual content of the resource and the intellectual rationale for whatever has been done needs to be documented in some metadata associated with the resource. It becomes more closely connected with the resource than a journal article describing what has been done. But one fundamental question I would like to ask is: can we build a repository structure that incorporates not only access in the library sense, but also access in the sense of manipulation and analysis of the resources? What would be needed to bring these two different infrastructure functions together and what are the key metadata requirements needed to underpin it?

3. DIGITAL LIBRARIES AND LINGUISTIC RESOURCES

What we are working towards seems to be a digital library of linguistic resources. For locating linguistic resources which can, as in a traditional library, be on any topic, it should be possible to draw on tools developed in the library and information science community where there are many years of experience in knowledge organization and the development of classification schemes and thesauri. Given the need to file books serially on shelves, many of these schemes are linear or hierarchic. Notable among these is the Library of Congress classification scheme, used very widely in North America, also the Dewey Decimal Classification. But some more sophisticated faceted classification schemes do exist (notably the Bliss Bibliographic Classification and recent revisions of the Universal Decimal Classification (http://www.udcc.org)), which have obvious applications in the electronic environment. See Broughton (2000) for a discussion of the role of these classification schemes in Web searching. Recent discussions on the Dublin Core dc-general list have centred around the role of classification schemes and thesauri for subject terms. The HILT (High-Level Thesaurus) Project (http://hilt.cdlr.strath.ac.uk/Sources/thesauri.htm) is studying the problem of cross-searching and browsing by subject across a range of communities, services, and service or resource types, and it has compiled a list of classification schemes and thesauri. Some of these are for specific communities and some are more general, but I do think we need to look at what we have in common with general approaches to web searching.

The library community also has a very well developed system of name authority files which deals with different spellings and forms of personal and corporate names, also geographic names of political and civil jurisdictions (http://lcweb.loc.gov/cds/name_aut.htm). It also has a set of standard names for languages at http://lcweb.loc.gov/marc/languages used for MARC cataloguing. I am by no means suggesting that these tools should be applied to the infrastructure without a more detailed analysis and evaluation, but I do want to emphasize that some of this landscape has been well-covered already by knowledge organization professionals and that it makes sense to see whether this can be an appropriate starting point.

4. THE TEXT ENCODING INITIATIVE

The Text Encoding Initiative (TEI) attempted to create a metadata framework for electronic texts. Developed in the early 1990s, the TEI header was, as far as I am aware, the first serious attempt to define metadata in the same format as the data which it describes. As the Oxford Text Archive can attest, before the TEI little was known about many existing electronic texts, even to the extent, in some cases, of not knowing what language they were in (Proud 1989). The TEI header consists of a set of SGML tags that provide not only bibliographic data for the electronic text, but also information to help the scholar who is using the material. The TEI header is intended to provide a chief source for cataloguing information and the International Standard Book Description (General) (ISBD(G)), the Anglo-American Cataloging Rules (second edition) (AACR2) and the ANSI standard governing bibliographic references were all "influential in developing the content of the TEI header" (Sperberg-McQueen and Burnard 1994: 137). The header consists of four main sections. Giordano (1994) provides the rationale for the header from a librarian's perspective.

The file description section provides bibliographic details and is built on the assumption that the electronic text is a different intellectual object from the source from which it is derived. It is just about possible to map the bibliographic details on to some fields of a MARC record and many of the tags in the file description will be familiar to cataloguers. The TEI guidelines include a separate chapter on independent headers which are TEI headers that can function without the attached text in order to build catalogues and other metadata structures.

The encoding description includes tags that describe principles governing the transcription of the source and thus provides some parameters to processing programs. It has nine subdivisions:

Some of these specifications can be presented either as prose text or as a set of descriptor terms.

The profile description collects together other important pieces of metadata, notably:

The revision history provides a record of changes made to the text, who was responsible for the changes and why they were made. A special version of the TEI header can handle corpora, where there is one header with information that is relevant to the whole corpus and separate headers for each component of the corpus. Dunlop (1995) discusses the use of TEI headers in the British National Corpus.

The TEI has been widely adopted for many applications, but I think it is also fair to say that people who do not have expertise in knowledge organization or library and information studies have not found headers easy to deal with. Some of the most successful headers are in digital library projects such as the Humanities Text Initiative at the University of Michigan (http://www.hti.umich.edu) or the electronic text centre at Virginia (http://etext.lib.virginia.edu/) or the Victorian Women Writers Project at Indiana University (http://www.indiana.edu/~letrs/vwwp/index.html). These services do provide a simple digital library environment, where the user locates a text using information derived from the header and then, using  the same interface, goes on to search the actual text (with Open Text as the backend search engine), but only word (string) searches are provided. A simple set of concordance-like entries is returned and the user can click on a reference citation to go to the complete text, and also to see the full details in the header.

The Oxford Text Archive (OTA) (http://ota.ahds.ac.uk/), which is now part of the UK Arts and Humanities Data Service, also uses TEI headers as its main method of data description. Chapter 6 of Morrison, Popham and Wikander (n.d.) discusses the TEI header in relation to metadata for the OTA texts and the OTA site has a page showing mappings between TEI header elements, USMARC fields and the Dublin Core elements.

The TEI header is intended to be both human readable and machine processable but in my view it has several problems which tend to place it somewhere between these two. Some of the key elements, especially in the encoding and profile descriptions, are intended to describe the intellectual rationale for the encoding. The intellectual ratoinale is obviously a key aspect for the user, but it is not easy to express it other than in prose text and the free form permitted by the header of course loses processability. The elements which encode the bibliographic details were defined by the librarians and archivists on the header work group, but few of these people at that time had experience of actually using electronic texts. It was thus difficult to make a comprehensive assessment of the suitability of the library model, which is intended for fixed format physical objects, for a new and much more flexible and dynamic form of object. But that does not mean to say that we should discard all the library science know-how. Rather, this know-how needs to be adapted to a different environment.

The TEI makes a clear distinction between metadata and data. For example the header has bibliographic details of a work but the front matter in the body of the text also includes tags for the information on the title page. Exactly what is data and what is metadata is not at all clear in some cases. Scholars who have tried to use the TEI header for manuscript description have found it difficult to decide what descriptors to put in the header and what are more appropriate for the description and transcription of the manuscript itself. The MASTER Project, which is defining a European Standard for Manuscript Description, proposes a new element <msDescription> which may appear either within the body or the header of a TEI-conformant document (Burnard and Robinson 1999).

The TEI also developed a method of encoding partial and uncertain features (Sperberg-McQueen and Burnard 1994: Chapter 17). Discussions on how to deal with these problems went into some considerable depth. Uncertainty is not a binary issue but a gradation of description. What do you mean when you say that you are 10% certain this is X? What happens when new information comes to light later on? The TEI went to some lengths to provide ways of representing multiple conflicting views of data, but the practicalities of doing this really inhibits its use. The TEI does, however, provide a resp attribute for some tags, giving a simple way of saying who is responsible for this particular interpretation, also a <respStmt> element giving a statement of responsibility for some of the header elements, and a more complex element <respons> to specify more detailed responsibility for different aspects of the interpretation of a feature.

5. ARCHIVES AND DATASETS

The proposed repository is also an archive of data and so it might also be worth examining models for data description in the archival community. ISAD(G) (http://www.ica.org/ISAD(G)E-pub.pdf) is an international standard for archival description, intended initially for a paper-based system. It includes a standard set of descriptors with emphasis on providing enough contextual information to enable researchers to use an archive or special collection of papers, photos and the like. The Encoded Archival Description (EAD) (http://lcweb.loc.gov/ead/) is an SGML (now XML) application for archival finding aids which can be mapped on to ISAD(G). EAD is very widely used and there are experiments in linking it to transcriptions or digital images of the material it describes with tools to help users work with the material. See for example the William Elliot Griffis Collection Online Prototype at Rutgers University (http://www.ceth.rutgers.edu/projects/griffis/project.htm) where a finding aid takes the user directly to searchable electronic editions of the documents. The EAD was modelled to some extent on the TEI. An EAD finding aid has a header followed by the archival description as the body of the document. The hierarchic nature of the EAD probably precludes its use for linguistic resources, but it might be useful to examine some of the intellectual rationale incorporated in the EAD as it is essentially a metadata structure.

Methods for data description are also very advanced in the social science and historical data archiving communities, where contextual material to enable the user to make sense of the data elements is crucial (ICPSR 1999; UK Data Archive n.d.). Data is described in the form of codebooks for which an XML application is under development. The UK Data Archive at Essex University includes some linguistic material in its holdings and has categories for linguistic data in its Humanities and Social Science Electronic Thesaurus (HASSET). Furthermore, links are already being made between ISAD(G) and electronic archival datasets. Shepherd and Smith (2000) explore the application of ISAD(G) to archival datasets, redefining some elements and adding some new ones.

6. WHERE NEXT?

There has not been space in this short paper to examine other more general metadata initiatives, but it seems as if the Resource Description Framework (RDF) (http://www.w3c.org/RDF) is going to be important in the future. The Dublin Core has been promoted as a metadata tool but it provides only a simple element set requiring extension for specific disciplines and applications. Miller and Greenstein (1997) provide details of a practical implementation of the Dublin Core for the humanities intended to underpin the Arts and Humanities Data Service gateway.

For this initiative, I think that there are two major issues. Firstly, whatever description mechanism is chosen or developed must be capable of being mapped automatically on to other standards. Without this the proposed repository will become more and more isolated. I think this is particularly important for a metadiscipline, which impinges on so many other disciplines, but it is also more problematic because of the need for semantic interoperability across very many topics and subject areas. Secondly, there should be integration between data and software. In a world where the Web is increasingly functioning as a desktop and an operating system, it does not make sense to me to have one infrastructure for locating resources and another one for carrying out research on what has been located. I would like to see these functions brought together within a single operating environment.

REFERENCES

Vanda Broughton, "Structural, linguistic and mathematical elements in indexing languages and search engines; implications for the use of index languages in electronic and non-LIS environments". Dynamism and Stability in Knowledge Organization; Proceedings of the Sixth International ISKO (International Society for Knowledge Organization) Conference 10-13 July 2000, edited by Clare Beghtol, Lynne C. Howarth and Nancy J Williamson. Wurzburg: Ergon, 206-212.

Lou Burnard and Peter Robinson. (1999). Towards a European Standard for Manuscript Description: the MASTER project. http://www.hcu.ox.ac.uk/TEI/Master/Hermes/front.htm.

Dominic Dunlop. (1995). "Practical Considerations in the Use of TEI Headers in a Large Corpus". Computers and the Humanities, 29: 85-98.

R. Giordano. (1994). "The Documentation of Electronic Texts using Text Encoding Initiative Headers: An Introduction". Library Resources and Technical Services, 38: 389-401.

ICPSR (Inter-university Consortium for Political and Social Research). (1999). ICPSR Guide to Social Science Data Preparation and Archiving. http://www.icpsr.umich.edu/ICPSR/Archive/Deposit/dpm.html.

Paul Miller and Daniel Greenstein. (1997). Discovering Online Resources Across the Humanities: A Practical Implementation of the Dublin Core. Bath: UK Office for Library Networking, 1997. http://ahds.ac.uk/public/metadata/discovery.html.

Alan Morrison, Michael Popham and Karen Wikander.(n.d). Creating and Documenting Electronic Texts: A Guide to Good Practice, http://ota.ahds.ac.uk/documents/creating/.

Judith K. Proud. (1989). The Oxford Text Archive. Oxford: British Library Research and Development Report.

Elizabeth Shepherd and Charlotte Smith. (2000). "The Application of ISAD(G) to the Description of Archival Datasets". Journal of the Society of Archivists, 21: 55-86.

Gary Simons and Steven Bird. (2000). RFC: Requirements on the Infrastructure For Digital Language Documentation and Description. 14 November 2000. http://www.ldc.upenn.edu/exploration/expl2000/requirements.html

C M. Sperberg-McQueen and Lou Burnard (eds.). (1994). Guidelines for the Encoding and Interchange of Electronic Texts . TEI P3. Chicago and Oxford: ACH, ACL and ALLC. http://www.uic.edu/orgs/tei/p3/doc/p3.html.

UK Data Archive. (n.d.). Good Practice in Data Documentation. http://www.data-archive.ac.uk/creatingData/goodPractice.doc.