Two Perspectives: Europe, and Newly Literate Languages
Nicholas Ostler
Linguacubun Ltd, and
Foundation for Endangered Languages
nostler@chibcha.demon.co.uk
Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
Abstract.
I briefly review a number of current European initiatives aimed at
establishing standards for representation of linguistic data in
digital form. These programmes have characteristic strengths as
multilingual common standards, but they also have major drawbacks as
a basis for web-based documentation that is closer to universal.
From a different direction, I also consider the requirements on
web-documentation from languages which have only recently become
literate. There is an unresolved conflict between the intrinsically
contentious progress towards a practical representation standard that
will be acceptable to a whole community, and the process of more
recondite argument and analysis which may converge on a script
preferable on academic grounds.
1. European Background
1.1. The Projects
The search for appropriate standards for electronic language
documentation and description has quite a long history in the
language research and engineering efforts of the European Union.
Since the mid-1980s a number of projects have been supported. The
major ones are listed here:
- Text Encoding Initiative (TEI), which developed
guidelines for the preparation and interchange of electronic texts
for scholarly research, and to satisfy a broad range of uses by the
language industries.
- Expert Advisory Group for Language Engineering Standards
(EAGLES),
which has formulated standards for speech
,
text representation, and tools .
- Speech Databases for Creation of Voice Driven
Tele-services
(SPEECHDAT), provided for 16 European languages (including Welsh)
- Preparatory Action for Linguistic Resources Organization
for Language Engineering
(PAROLE), which built corpora and lexica for 12 languages, the
official European languages (+ Catalan, and corpus for Irish)
- Trans European Language Resources Infrastucture (TELRI, and TELRI-II), building up
corpus resources and common tools in the languages primarily of East
European and former Soviet republics
- European Language Activity Network (ELAN), intended
to provide a common web-based access framework for the PAROLE and
TELRI resources
- European Language Resources Association (ELRA), which provides a
catalogue of linguistic resources (speech databases, text corpora and
lexica, terminologies) in a variety of languages on a payment basis.
Besides these more broadly-based actions, a number of EU-funded
projects have attempted to pioneer representation standards their
own. Two examples, with which I happen to have been associated are:
MULTILEX , attempting to set a common formalism for multilingual
lexica, and
GRAMLEX
, which proposed a representation for morph analysis of highly
inflected, and of agglutinative, European languages.
1.2. Retrospect and Evaluation
The strengths of these projects have been substantial.
- They involved simultaneous work undertaken from a
multilingual perspective that encompassed all the major official
languages of Europe.
- They received substantial funding in whole (or more
usually, in part) from the European Union, sometimes for up to a
decade.
- And the concerted actions in particular (SPEECHDAT,
LE-PAROLE, EAGLES) were undertaken by teams that encompassed the
whole breadth of the European Union.
Their limitations, however, need to be borne in mind before accepting
their outcomes as prototype standards for multilingual resources more
universally.
- They only involved the majority languages of Western (and
to an increasing extent Eastern) Europe. (Irish is partly
represented in LE-PAROLE - with a corpus but no lexicon - and Welsh
is present within SPEECHDAT.)
- Large-scale multilingual corpora collections remain prey
to access problems, often because copyright issues remain unresolved.
(ELAN is not fully functional.)
- The textual resources accumulated under these standards
are available as an infrastructure, but for no evident applications.
- By and large the public have not been involved in the
compilation of these textual resources, which have been put together
by academics and (to a lesser extent) corporations and public bodies.
- There has been relatively small involvement by
theoretical linguists in these standards and use of the corpora: in
general, only computational linguists have participated.
- Although the projects have usually received some funding
from participating speech and language engineering companies, there
has been little industrial exploitation of the resulting resources.
(Speech data have been much more used than textual data.)
2. Newly Literate Languages
2.1. Two Conferences
In two recent conferences, there has been considerable attention
devoted to the requirements of newly literate communities on their
language representations. It seems possible to draw some conclusions.
The two conferences are:
Foundation for Endangered Languages IV, held in September 2000 in
Charlotte, NC, USA, whose theme was; Endangered Languages
and Literacy , and
Endangered Languages of the Pacific Rim, (ELPR),
held in November 2000 in Kyoto, Japan. Here in particular Willem
Adelaar spoke of the problems of standardizing newly described
languages (such as Quechua) in disregard of indigenous traditions and
attitudes; and Terrence Kaufman gave some general recommendations on
effective methods to document endangered languages.
2.2. Points Emerging
The early stages of a literate tradition are inevitably contentious.
This is especially because the choices implicit in setting up a
standard heighten awareness of, and rivalry between, different
dialects. Although the contentions can be resoved, this inevitably
takes time, usually an unpredictably long period of time. These
essentially political delays may subvert the schedule of a precisely
defined project.
Furthermore, the preferred option of a language community may, in
various respects, tell against the option chosen by linguistic
analysts. Communities seeking easy access for their members to a new
system are likely to set a high value on familiarity, so that (e.g.)
anomalous features of a spelling system from a local metropolitan
language may be carried over into representing the newly literate
language.
This underlines the fact that arriving at an acceptable set of norms
for a language is more a matter of patient negotiation than incisive
analysis.