Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
As texts in a diversity of languages are created, shared, and archived electronically, the need to describe the writing system of these texts becomes readily apparent. How can one know what is represented by the graphic symbols used in a text? Or how can various computer processes treat a string of characters in a little known language?
A writing system description formally describes the characteristics of each of the units of writing in a given language. It also describes the relationship of these units to each other and to linguistic units. Such a description allows the machine and the user to share information about a writing system and enables a computerized process to intelligently interact with a person.
This paper introduces a formal model for writing system descriptions that captures the semantics of writing systems for both humans and computers.
The advent of the computer, ushering in the current information age, has brought the ability to create, store, and manipulate large amounts of information. Much of this information is textual in nature, representing writing from numerous languages. Furthermore, each language is written differently.
As texts in a diversity of languages are created, shared, and archived electronically, the need to describe the writing systems used by these texts becomes readily apparent. How can one know what is represented by the graphic symbols used in a text? Similarly, how can various computer processes “know” how to treat a string of characters in a little known language?
It is the premise of this paper that the solution to these problems are electronic writing system descriptions that can be read by people or computational processes to gain access to the information contained within. The advent of computer technology has created an altogether new medium for writing: digital writing, where writing is retained not on paper or clay tablets but in a series of invisible electronic bits and bytes. The fact that computers themselves are able to store writing means that the relationship between graphic units and computational units must be described.
This need for documentation has not gone unnoticed. Gary Simons and Steven Bird reference this need as a user requirement (6) for a digital infrastructure of language documentation and description, stating that “archived information [must be]… self-documenting with respect to how it is electronically encoded” (Simons and Bird 2000:§2.1).
The Text Encoding Initiative (TEI) sought to address this problem. It introduced Writing System Declarations (WSD) as auxiliary documents referenced by materials that used the TEI standard for text encoding and interchange (Sperberg-McQueen and Burnard 1999). Although it treats key issues relating to the encoding and rendering of text, TEI WSDs do not provide linguistic information about the writing systems and unfortunately do not enjoy widespread usage (Birnbaum, Cournane and Flynn 1999:50).
Writing system descriptions are the key to understanding written materials. Just as legends provide vital information for map reading, so also writing system descriptions provide vital information for archived written materials. There is enough overlap between the information about writing systems that is required by computers and that required by people to merit electronic writing system descriptions that can address the concerns of both parties.
There are several components to consider when exploring the workings of a writing system. What are the properties and interrelationships of the characters, the atomic units of writing systems? What linguistic elements are expressed in graphic form? How are the sounds (or other linguistic elements) expressed in the writing system? What are their names? Are there special contexts that determine variant graphic forms of the same character? What is the order into which the characters are conventionally arranged?
In addition to these details which a person needs to know about writing systems, the computer must also be able to recognize characters from electronic storage and take in their properties. It must be able to sort character strings. If it is to insert hyphenation points or check the spelling or grammar of a text, it must be able to identify syllables, words, and sentences. If it is to perform searches for abstract patterns, it must be able to determine whether a given character is a member of a given class of characters. In order for a computerized process to intelligently interact with someone concerning a writing system, the machine must share the same knowledge as the user.
When computer programs use electronic writing system descriptions as the source for information about the writing system, tasks that are writing system dependent can be handled in a general, efficient way for any language that has an electronic writing system description. Some of these tasks include breaking text strings into units such as words or sentences, hyphenation, patterned searches, and sorting. Since the electronic writing system descriptions exist separate from the computer programs that use them, many programs can share a single description and new descriptions can be created without modifying the program's code.
This paper presents a model for writing system descriptions that captures this type of information and can be used by both humans and computers. It is motivated by theory from the study of writing systems and informed by what is actually attested in existing writing system descriptions.
This model is designed to be functional, that is, it is designed to be useful. Linguists have recently begun to question the utility of the products of linguistic research: “To what extent do linguists' descriptions serve ‘consumers’ in domains beyond the discipline of linguistics?” (Butt 1996:xv–xvi) The intent of this research is to create a model that will serve not only linguists, but extend into a number of complementary domains.
It is the goal of this paper to encourage the description of writing systems and the dissemination of the knowledge contained therein. As such, this paper has a dual audience and a dual approach. First, it is aimed toward researchers who want to describe the writing systems of the world or who are involved in the creation of writing systems for as yet unwritten languages. It provides the groundwork for a repository of electronic writing system descriptions that would provide a resource for comparison, which may serve as a guide for possible solutions to difficult problems linguists face while creating writing systems. Thus, it contributes toward the larger goals of documenting and describing linguistic information for languages around the world such as would be needed for the proper documentation of archived electronic texts. Second, this paper is aimed at computer programmers who want to write generalized software that is truly multilingual and extensible.
Although the intent of this research is to design a method that is general enough to handle the wide variation attested in the world's writing systems, to completely validate that such a method can accurately handle any of the world's writing systems would involve the impossible task of describing every writing system in the world. However, Gary F. Simons has demonstrated that the prominent diversity of the world's writing systems are governed by simple processes, considering that “while there is tremendous diversity in the graphic symbols…, there is not very much conceptual diversity” (1989:539) in the writing systems of the world. Thus, if we determine the conceptual processes that operate, and provide a means of describing writing systems in terms of these similarities, the wide variation can be managed.
Let's take a look at how these writing system descriptions can serve as resources for users to determine what the graphic forms represent.
The current Prime Minister of Fiji is Laisenia Qarase. Without an understanding of Fiji's writing system, however, an attempt to pronounce his name would likely be misguided. A quick look at an excerpt from an electronic writing system description for Fijian (Figure 1) reveals that the grapheme áqñ corresponds not to a voiceless velar plosive /k/ as we might be led to believe from other Roman based writing systems, but instead it corresponds to the prenasalized voiced velar plosive /ŋg/.
Figure 1 Fijian Electronic Writing System Description Screen Shot
While Roman based writing systems are prevalent, and may lead some to the fallacy that every writing system represents language using the same conventions, non-Roman writing systems such as Burmese (Figure 2) obviously require interpretation.
Figure 2 Burmese Writing
Even exotic writing systems can be deciphered when a description is provided. If we look at the first unit of writing in the Burmese example in Figure 2, we can then compare this to those provided in Figure 3.
Figure 3 Burmese Electronic Writing System Description Screen Shot
Such analysis should lead us to the conclusion that the first unit of writing in Figure 2 corresponds to the grapheme named top-indented ba. Further analysis can then determine the corresponding linguistic unit. Figure 4 provides two rules that link top-indented ba with phonemes. Since the first of these rules requires further analysis to discover whether the second unit of writing in Figure 2 is a member of the class vowel diacritic or the class medial consonant, the search is not complete here. However, with continued analysis, it would be simple to determine that Figure 2 represents the syllable sequence /ba/ with creaky tone (tone 1) and /ma/ with low tone (tone 2).
Figure 4 Burmese Electronic Writing System Description Screen Shot
As is shown by these examples, writing system descriptions provide a simple means to accurately decipher written texts. By making such writing system descriptions available over the Internet, users can easily access the information they need.
There are potentially many theoretical perspectives on the study of writing systems. The framework for describing writing systems must accommodate more than one perspective. Thus, this system is tends to be more general than it is specific.
M. A. K. Halliday suggested that there should be no distinction between “describing some feature and relating it to other features: describing anything consists precisely of relating it to everything else” (1996:21). Thus, if we assume that each feature must first be declared or named so that we might reference it, a description consists entirely of declarations and relations. Specifically, then, for writing system descriptions, the description for each feature consists of a declaration of the feature, and a series of relations to other features.
At first glance, this may seem to contradict Peter T. Daniels and William Bright's claim that in order to describe a writing system, “the characters of each writing system must be inventoried, and their use and interpretation ascertained” (1996:1). However, if the use of characters is described using relations as Halliday has suggested, then the interpretation of these characters could be surmised from these relations. Thus, Daniels and Bright's requirement for the description of writing systems can be framed in terms of declarations and relations.
The most basic of information associated with any description of a writing system, is the fundamental assumption that writing is built up from discrete segments. It is around these graphic units that the description is structured.
From a linguistic perspective, writing is a “device for expressing linguistic elements by means of visible marks” (page 13). Therefore, a writing system uses graphic forms to represent some level of linguistic units, such as words, syllables, or sounds, to name a few. Many linguists have found it helpful to acknowledge the existence of graphemes that “denote the minimal functional distinctive unit of any writing system” (Henderson 1984:15 (note)). Thus, graphemes are expressed by surface graphic forms that potentially vary given a particular context. This is illustrated in Figure 5 where the Greek grapheme sigma has two forms: one for word initial and medial contexts and another for word final context.
Figure 5 Contextually Variant Forms
The linguistic perspective, as indicated in Figure 6, requires the declaration of three sets of components that are related such that there is a mapping between linguistic units and graphemes, and there is a mapping between graphemes and graphic forms. The process of writing is described by the mapping from linguistic units through graphemes to graphic forms, while the process of reading is the reverse mapping, from graphic forms through graphemes to linguistic units.
Figure 6 Linguistic Perspective
Joseph D. Becker was the first to describe the three aspects of text processing from a computational perspective:
There must be a way for text to be represented in the memory of a computer; there must be a way for text to be typed at the keyboard of a computer; there must be a way to present text to the typist. I shall refer to these realms as encoding, typing, and rendering. By rendering I mean both the display of text on the screen of a computer and the printing of text on paper. (1984:96)Figure 7 shows the process by which the input from the keyboard is mapped into the appropriate encoding which is then mapped into the appropriate rendering.
Figure 7 Computational Perspective
The result of the computational perspective, the rendering, is the same as the graphic forms of the linguistic perspective. These two perspectives could be combined as they have been in Figure 8. This model demonstrates the disparity between the two perspectives. Although achieving the same results, the processes are far from integrated.
Figure 8 Combined Perspective
This combined perspective does not accurately model the intent of the person who is typing text. When a person types a key on the keyboard, does he understand that this is mapping to an encoding? Perhaps, but his intent is to indicate the grapheme that he wants. By giving precedence to the grapheme, and relating all the other aspects to it, we get a better model for the integration of computational and linguistic perspectives of writing systems.
Figure 9 indicates an integrated perspective, which retains the original features of the linguistic perspective but couples each of the computational processes directly to the grapheme. This model more accurately represents the intent of the computational processes while providing the linguistic grounding for these processes.
Figure 9 New Perspective
In addition to providing a more accurate model of the intent of the computational processes, this new perspective has decoupled the computational processes, allowing for multiple instances of each of these computational mappings. This is the model that is used for the electronic writing system descriptions.
Each of the relationships indicated by arrows in Figure 9, requires a description. Although some of these relationships involve a one-to-one mapping, often the relationships are more complex, requiring many-to-many mappings based on the context.
The best way to represent these relationships is to use rules that allow a many-to-many mapping that is context sensitive. Mapping rules have been used extensively in the study of writing systems as well as to represent the relationships between characters and linguistic units (Sproat 2000, Carney 1994, Derwing, Priestly and Rochet 1987, Haas 1983). Rules are commonly used in many areas of linguistic description and this method for representing rules should extend to other linguistic domains.
Joseph D. Becker (1984) described the need to define regular rules to define the correspondences between characters and graphic forms. Gary F. Simons demonstrated that these rules can handle a wide variety of rendering problems, “a generalized implementation of writing systems would allow users to describe new writing systems by expressing the mapping from characters to graphs as rewrite rules. The computer could take over from there to compile that description into a very efficient implementation as a finite state transducer” (1989:545). A finite state transducer is one of the most simple and efficient computer algorithms.
The rules apply based on a precedence operation such that the more specific the rule, that is, the more context provided for the rule, the sooner it applies.
Two special sets of rules have been defined. One is for ordering relationships, which define an “alphabetic” order for the graphemes. And another set is for graphotactic constraints, which define constraints on the co-occurrence of graphemes.
While rules are used to express relationships, we still need to declare and reference the components that must be related. For a writing system description, we naturally need to declare the graphemes. In addition, there may be writing system units other than the grapheme that fit into a writing system hierarchy and need to be described. Figure 10 indicates one such hierarchy where sentences contain phrases, phrases contain words, words contain syllables, syllables contain graphemes, and graphemes contain features. Since these are categories of analysis, the actual hierarchy would depend on the particular writing system and the theoretical analysis. Rewrite rules are used to define these units.
Figure 10 Hierarchy of Writing System Units
Rather than attempt to declare all the linguistic units, I assume that these have been defined elsewhere. A phoneme is better defined in a phonology description than in a writing system description. Therefore, what is required of the writing system description is a mechanism to link between that definition and any reference we might make within the description.
The computational units, that is, the typing characters, coded characters, and glyph identifiers, are referenced by value (or optionally by name in the case of the glyph identifier).
Sometimes we want to allow a choice between different units, or to define optional or repeated units. This is accomplished with the general-purpose group, which can define either a sequence of units or a choice between units. In addition, groups can exclude units.
This paper has presented the need for writing system descriptions as integral elements to the documentation of any written resource. A general model for writing system descriptions has been introduced, based on the integration of linguistic and computational notions of writing systems. This model is formalized and electronic and thus can be used by computational processes. The system for electronic writing system description is being contributed to the web-based infrastructure of linguistic description and documentation both as a resource for documenting how texts are electronically encoded, as well as a descriptive resource in its own right.
Becker, Joseph D. 1984. “Multilingual word processing.” Scientific American Volume 251. Issue 1. pp. 96–107.
Birnbaum, David J., Mavis Cournane and Peter Flynn. 1999. “Using the TEI Writing System Declaration (WSD).” Computers and the Humanities Volume 33. Issue 1–2. pp. 49–57.
Butt, David. 1996. “Theories, Maps and Descriptions: An Introduction.” in Hasan, Ruqaiya, Carmel Cloran and David Butt, eds. Functional Descriptions: Theory in Practice Amsterdam, Philadelphia: John Benjamins Publishing. Current Issues in Linguistic Theory Volume 121. pp. xv–xxxv.
Carney, Edward. 1994. A Survey of English Spelling London and New York: Routledge.
Daniels, Peter T. and William Bright, eds. 1996. The World's Writing Systems New York: Oxford University Press.
Derwing, Bruce L., Tom M. S. Priestly and Bernard L. Rochet. 1987. “The Description of Spelling-to-sound Relationships in English, French and Russian: Progress, Problems and Prospects.” in Luelsdorff, Philip A., ed. Orthography and Phonology Amsterdam, Philadelphia: John Benjamins Publishing. pp. 31–52.
Gelb, I. J. 1963. A Study of Writing Revised edition, first published 1952. Chicago: University of Chicago Press.
Haas, William. 1983. “Determining the Level of a Script.” in Coulmas, Florian and Konrad Ehlich, eds. Writing in Focus Berlin, New York and Amsterdam: Mouton. Trends in Linguistics. Studies and Monographs Volume 24. pp. 15–29.
Halliday, M. A. K. 1996. “On Grammar and Grammatics.” in Hasan, Ruqaiya, Carmel Cloran and David Butt, eds. Functional Descriptions: Theory in Practice Amsterdam and Philadelphia: John Benjamins Publishing. Current Issues in Linguistic Theory Volume 121. pp. 1–38.
Henderson, Leslie. 1984. “Writing Systems and Reading Processes.” in Henderson, Leslie, ed. Orthographies and Reading: Perspectives from Cognitive Psychology, Neuropsychology, and Linguistics London, Hillsdale, NJ: Lawrence Erlbaum Associates. pp. 11–24.
Simons, Gary and Steven Bird. 2000. RFC: Requirements on the Infrastructure for Digital Language Documentation and Description Draft 14 November 2000. http://www.ldc.upenn.edu/exploration/expl2000/requirements.html.
Simons, Gary F. 1989. “The Computational Complexity of Writing Systems.” in Brend, Ruth M. and David G. Lockwood, eds. The Fifteenth LACUS Forum 1988 Lake Bluff, IL: LACUS. pp. 538–553.
Sperberg-McQueen, C. M. and Lou Burnard, eds. 1999. Guidelines for Electronic Text Encoding and Interchange Revised Reprint. Chicago, Oxford: Text Encoding Initiative.
Sproat, Richard. 2000. A Computational Theory of Writing Systems Cambridge: Cambridge University Press. Studies in Natural Language Processing