Two Perspectives: Europe, and Newly Literate Languages

Nicholas Ostler
Linguacubun Ltd, and
Foundation for Endangered Languages
nostler@chibcha.demon.co.uk

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. I briefly review a number of current European initiatives aimed at establishing standards for representation of linguistic data in digital form. These programmes have characteristic strengths as multilingual common standards, but they also have major drawbacks as a basis for web-based documentation that is closer to universal. From a different direction, I also consider the requirements on web-documentation from languages which have only recently become literate. There is an unresolved conflict between the intrinsically contentious progress towards a practical representation standard that will be acceptable to a whole community, and the process of more recondite argument and analysis which may converge on a script preferable on academic grounds.


1. European Background

1.1. The Projects

The search for appropriate standards for electronic language documentation and description has quite a long history in the language research and engineering efforts of the European Union. Since the mid-1980s a number of projects have been supported. The major ones are listed here: Besides these more broadly-based actions, a number of EU-funded projects have attempted to pioneer representation standards their own. Two examples, with which I happen to have been associated are:
MULTILEX , attempting to set a common formalism for multilingual lexica, and GRAMLEX , which proposed a representation for morph analysis of highly inflected, and of agglutinative, European languages.

1.2. Retrospect and Evaluation

The strengths of these projects have been substantial.

  1. They involved simultaneous work undertaken from a multilingual perspective that encompassed all the major official languages of Europe.
  2. They received substantial funding in whole (or more usually, in part) from the European Union, sometimes for up to a decade.
  3. And the concerted actions in particular (SPEECHDAT, LE-PAROLE, EAGLES) were undertaken by teams that encompassed the whole breadth of the European Union.

Their limitations, however, need to be borne in mind before accepting their outcomes as prototype standards for multilingual resources more universally.

  1. They only involved the majority languages of Western (and to an increasing extent Eastern) Europe. (Irish is partly represented in LE-PAROLE - with a corpus but no lexicon - and Welsh is present within SPEECHDAT.)
  2. Large-scale multilingual corpora collections remain prey to access problems, often because copyright issues remain unresolved. (ELAN is not fully functional.)
  3. The textual resources accumulated under these standards are available as an infrastructure, but for no evident applications.
  4. By and large the public have not been involved in the compilation of these textual resources, which have been put together by academics and (to a lesser extent) corporations and public bodies.
  5. There has been relatively small involvement by theoretical linguists in these standards and use of the corpora: in general, only computational linguists have participated.
  6. Although the projects have usually received some funding from participating speech and language engineering companies, there has been little industrial exploitation of the resulting resources. (Speech data have been much more used than textual data.)

2. Newly Literate Languages

2.1. Two Conferences

In two recent conferences, there has been considerable attention devoted to the requirements of newly literate communities on their language representations. It seems possible to draw some conclusions.

The two conferences are: Foundation for Endangered Languages IV, held in September 2000 in Charlotte, NC, USA, whose theme was; Endangered Languages and Literacy , and Endangered Languages of the Pacific Rim, (ELPR), held in November 2000 in Kyoto, Japan. Here in particular Willem Adelaar spoke of the problems of standardizing newly described languages (such as Quechua) in disregard of indigenous traditions and attitudes; and Terrence Kaufman gave some general recommendations on effective methods to document endangered languages.

2.2. Points Emerging

The early stages of a literate tradition are inevitably contentious. This is especially because the choices implicit in setting up a standard heighten awareness of, and rivalry between, different dialects. Although the contentions can be resoved, this inevitably takes time, usually an unpredictably long period of time. These essentially political delays may subvert the schedule of a precisely defined project.

Furthermore, the preferred option of a language community may, in various respects, tell against the option chosen by linguistic analysts. Communities seeking easy access for their members to a new system are likely to set a high value on familiarity, so that (e.g.) anomalous features of a spelling system from a local metropolitan language may be carried over into representing the newly literate language.

This underlines the fact that arriving at an acceptable set of norms for a language is more a matter of patient negotiation than incisive analysis.