Data Design for Endangered Languages:
Increasing the "Linguistic Bandwidth"

David Nathan, AIATSIS

Australian Institute of Aboriginal and Torres Strait Islander Studies
GPO Box 553, Acton, ACT 2602 Australia
djn@aiatsis.gov.au | www.aiatsis.gov.au | www.dnathan.com


Recently there has been a remarkable convergence between contemporary issues for Australianist linguistics and advancements in information technology. Along with increased interest in language maintenance and/or revival, especially within indigenous communities, we now have opportunities provided by multimedia and Internet technologies. I will show prototype tools that facilitate the development of structured multimedia and hypertextual linguistic resources that are platform-independent and use a standard format (XML).

With colleagues, I have been researching and implementing tools to meet two major requirements for contemporary work with endangered languages: richer documentation of linguistic knowledge and events; and stronger support for producing materials for language maintenance in the communities.

In the past, our projects have focussed on modelling data and processes, and implementing them in structured formats such as databases [3][4]. Following the growth of electronic networks, the value of such data orientation has been highlighted, because it allows data to be restructured, repurposed, compared, or combined with other data. What has been missing, however, is:

Peter Austin and I have recently proposed a process called the Australian Linguistic eXchange (ALX), that will establish draft standards for encoding and exchange of lexicographic, interlinear, text, and sound data.

In the presentation I will discuss new approaches in the context of software tools for presenting sound, text, lexicon and linguistic analysis. A platform for producing and presenting linked audio, video, text, and linguistic description, originally developed in collaboration with Dr Eva Csató at Tokyo University of Foreign Studies for the endangered Turkic language Karaim, provides a generalised template architecture and has proved adaptable for other languages of similar typologies [2]. Transparent import formats allow it to support collaborative, iterative development of resources. We have demonstrated the use of the system for several languages including Sasak and Yolngu-Matha.

Fig 1:Karaim platform, main screen features Karaim CD, main screen.

This platform can be regarded as a multimedia browser for richly linked linguistic data. The next step is to build a complementary tool for transparently authoring this kind of material. This phase has begun by developing a system for annotating realtime media (sound, video). When completed, this system will allow sound and video to be linked to transcriptions, a lexicon, morphological analysis, or any other user-specified description, and then output to structured XML files. The new resources can then be archived, printed, viewed using a multimedia browser such as a web/XML browser or the Karaim platform, or published to allow restructure and re-use for different purposes by others. The collaborative process has been illustrated by accessing an XML-encoded dictionary via the Internet, extracting text and references from it, and then inserting them into interlinear annotations that combine the lexicographer's data, the recorded linguistic event data, and the linguistic description/analysis. The resultant annotations are exported (in an explicit XML format) to allow the linguistic description and analysis to refer to the original linguistic performance.

Fig 2:Annotation tool, accessing remote dictionary Annotation tool, accessing remote dictionary.

It can be seen that there are methodological implications of such approaches, as they make linguistic description and analysis (cf [1]):

Thus, by making explicit the relationships between language performances and the recordings, descriptions, and analyses derived from them, the derived descriptions, analyses etc. become representations and implementations of pathways through the linguistic knowledge domain, anchored ultimately to the performances of language speakers. In turn, the formalised relationships become a framework for co-operation, not only between linguists, but also between linguists, language teachers, and the communities who are increasingly pressing for our contribution.

References

[1] Bird, S. (1999). Multidimensional exploration of online linguistic field data. NELS 29: 33-50.
[2] Nathan, D. (in press) 'The spoken Karaim CD: Sound, text, lexicon and "active morphology" for language learning multimedia', in Proceedings of the Ninth Annual Conference on Turkish Linguistics (Oxford, 1998).
[3] Nathan, D. 1996. "Caught in a Web of Murri Words: Making and Using the Gamilaraay Web Dictionary", in Library Automated Systems Information Exchange, Vol 27 No 4 (December 1996), pp 35-42
[4] Nathan, D. & Austin, P. 1992. "Finderlists, Computer-generated, for bilingual dictionaries". In International Journal of Lexicography 5:1, 32 - 65.


Linguistic Exploration Workshop