Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
Abstract. For archived linguistic and language data, the Unicode™ standard is the best choice for character encoding. In building an infrastructure for online archives of linguistic and language data, there are some aspects of Unicode that need to be understood, and some outstanding issues regarding implementation of Unicode that need to be resolved. This paper will provide an explanation of those aspects of the standard that are of particular relevance, and will discuss the outstanding issues that require further consideration.
Two important purposes for published electronic data archives are to maintain valued information on a long-term basis, and to make that information available to a given audience. Corresponding to this, it is important that the encoding mechanisms in terms of which data are represented be documented as part of the metadata of the archive so that the data can be interpreted long after it was created. It is also important that these encoding mechanisms can be interpreted by the various software tools used by the consumers of the archive so that the archive is usable. This is true of character encoding as well as any other aspect of data encoding.
In the past, the information technology (IT) industry has provided various character encoding standards based on 8-bit text processing technologies. Some of these made use of multi-byte sequences to represent characters so as to support large character repertoires. All of these legacy standards, however, were characterized by being limited in their repertoires: any one of them could support the writing systems of only a relatively small number of languages, and the combined character repertoire of all of them was sufficient to support most of the world's major languages, but not much more. In particular, many symbols used for phonetic transcription and many characters used for writing lesser known languages are not supported in any of the many legacy industry standard encodings.
On the whole, linguists have dealt with this limitation in a resourceful manner by creating their own fonts, customizing them to support the particular character repertoires that they need. The result, however, has been hundreds if not thousands of different fonts using incompatible, non-standard encodings, and large quantities of linguistic data with undocumented encodings. Such data is meaningful only when accompanied by specific matching fonts.
The Unicode™ character encoding standard provides a solution to encoding problems that is ideally suited for the needs of archived language and linguistic data. In relation to long-term documentation needs, Unicode is a well-documented international standard that is expected to become the dominant character encoding standard within IT for the foreseeable future. It is also the default encoding that is assumed in certain important IT standards, such as XML. Because it is becoming the dominant standard, a wide variety of software products are being developed that support it. As a result, it will be usable by the widest group of users. Furthermore, it aims to support a universal character repertoire that is sufficient to represent the writing systems of all languages, modern and ancient, which makes it especially well suited to the needs of linguistic and language data. For these reasons, Unicode is, without question, the best choice for character encoding in a coordinated archive of linguistic and language data.
Simply deciding to use Unicode does not resolve all of the character encoding issues that need to be considered, however. In order to begin building an infrastructure for a cooperative online archive, in order to begin building the software tools for working with that archive, and in order to begin creating the content to populate that archive, there are some aspects of the Unicode standard that need to be understood, and some outstanding issues regarding how best to implement Unicode that remain to be resolved.
The purpose of this paper is to provide a brief tutorial on Unicode, focusing on those aspects that have particular relevance for the purposes of this workshop, and to describe the outstanding issues that need to be addressed. In some cases, I will provide recommendations as to how these issues might best be resolved, though some will remain for further consideration.
In explaining some of the details of Unicode, my intent is to keep the discussion on a non-technical level as much as possible. Some minimal amount of technical detail will be necessary, however. I do not assume that the reader already has any in-depth technical understanding of Unicode, though I do assume at least basic familiarity. For detailed technical information regarding the standard, the best source is the standard itself [TUS 3.0].
Unicode provides a means for encoding plain text character data. In a sense, then, it represents the lowest level in a three-level hierarchy of text data representation:
|
Level |
Function |
Examples |
| Metadata | Documents the content of a file and the minimal information required in order for a process to begin interpreting the contents. | file names, file extensions (e.g. ".DOC"), MIME type |
| Document encoding | Encodes the higher-level structure of a document and the various streams of content that comprise it. | XML, RTF, PDF, MS Word binary format |
| Character encoding | Provides a digital representation of character data contained in the document. | ASCII, ISO 8859-1, Unicode |
(For purposes of the remainder of this discussion, I will assume document encoding is done in terms of a markup language such as XML.) In a text processing system, there must be components that deal with information on each level, and appropriate standards are necessary to define mechanisms for representation and processing on each level. Unicode does this only for the level of character encoding.
In practice, there is not always a clean division between the document and character encoding levels. This is especially true for markup languages, such as XML, that utilize character sequences as mechanisms for document encoding. Because of these mechanisms, some characters require special representation when used within the textual content of a document that is encoded using such a markup language. For XML, these issues are documented in the XML specification [XML].
In certain situations, some aspects of the semantic content of a document could potentially be encoded either at the character level, in terms of particular character sequences, or at the document encoding level, using some form of markup. For example, mathematical formulas use textual characters, but can also involve operators that require certain visual control for presentation purposes. It may not always be clear whether such semantics are best represented in terms of character sequences or in terms of markup.
To consider an example more closely related to languages and linguistics, some writing systems involve bidirectional text, in which portions are written right-to-left while other portions are written left-to-right. Writing systems based on Arabic script are typical examples: alphabetic characters are written right-to-left, but numbers are written left-to-right (when considering the most significant digit first). Furthermore, a multilingual document might contain runs of Arabic text embedded within spans of text in another language that is written consistently left-to-right. In certain situations, the directional properties of characters are not sufficient to provide the exact level of control of the text direction that a user may require. As a result, it is necessary to have some encoding mechanism to control the directional behaviour. In principle, this could be done using either character sequences or markup. Indeed, both Unicode and HTML provide mechanisms for this very purpose, and mixing both in a single document can result in ambiguous encoding states.
Some important cases in which Unicode control characters and markup control mechanisms may conflict are discussed in Draft Unicode Technical Report #20 [DUTR 20]. For the purpose of linguistic and language data archives, there may be special situations in which a choice needs to be made between character versus markup mechanisms for encoding particular aspects of semantic content. I will return to this matter in sections 3.1 and 3.2.
We have seen that Unicode as a character encoding standard interacts with other standards that relate to higher levels of text data representation. If we focus only on the level of character data processing and plain text, Unicode is again but one of a collection of technologies that interact.
It is useful in this regard to consider a five-part text-processing model:
In this model, encoding corresponds to the memory representation and storage of text. Input and rendering correspond to the most fundamental of text-based processes: generating the text (typically using a keyboard), and viewing/displaying the text. Analysis represents a collection of many secondary processes for working with text: sorting, case mapping, hyphenation, morphological parsing, etc. Conversion represents a supporting process of transforming data between one character encoding and another.
The encoding component in this model has central importance since the processes in each of the other components are implemented in terms of the character encoding. If the character encoding is to be Unicode, then the various text processes must be implemented in terms of Unicode.
This presents significant implications for text archives that use Unicode: it is necessary to have keyboards that generate Unicode-encoded data. Likewise, it is necessary to have fonts that are base on Unicode encoding. (The use of Unicode also introduces rendering requirements that relate to complex script support. This is discussed further in section 2.3.) Similarly, applications that process text and any supporting components for analytical text processes must be implemented using Unicode. In addition, for as long as users need to work with text encoding in legacy encodings, they require tools for encoding conversion that can map between Unicode and other encodings.
It is important to see Unicode in the broader context of a complete text processing model. Practical implementation using Unicode requires all of these other pieces of the overall puzzle to be in place.
It is often assumed that Unicode is a uniform-width 16-bit encoding. While this was part of the early design goals, the practical requirements of implementation necessitated a multi-tiered character encoding model. This character model is described in detail in Unicode Technical Report #17 [UTR 17]. I will describe here only the relevant details.
At the most basic level, Unicode defines a set of characters. For example, LATIN SMALL LETTER O WITH HORN. This set is referred to as an abstract character repertoire. At the next level, a coded character set, each character is assigned a unique integer value. For Unicode characters, these integers are referred to as Unicode scalar values, and are usually cited using the hexidecimal-based notation U+xxxx. For example, U+01A1 LATIN SMALL LETTER O WITH HORN. Unicode scalar values come from a possible range of U+0000 to U+10FFFF, with space for over 1 million characters.
At this point, it is important to understand that the encoded representation of a Unicode character is not necessarily the same as its Unicode scalar value. The next level in the model involves mapping Unicode scalar values into computer data types of a fixed size, and three such mappings have been defined: one using 8-bit (byte) values, one using 16-bit values, and one using 32-bit values. This level in the model is referred to as character encoding form, and the three encoding forms defined by Unicode are known as UTF-8, UTF-16, and UTF-32.
In UTF-32, Unicode scalar values are mapped by identity to 32-bit integer code units of the same value. 8- and 16-bit code units cannot directly represent over a million characters, however. Therefore, the mappings into UTF-8 and UTF-16 are more complex, using sequences of code units. The UTF-16 encoding form uses sequences of one or two 16-bit code units to represent a Unicode scalar value, while UTF-8 uses sequences of one to four 8-bit code units to represent a Unicode character.
The reason for having multiple encoding forms is one of practicality in implementation. Originally, only a 16-bit encoding form was envisioned, but an 8-bit form was needed for integration with existing 8-bit implementations, such as Unix file systems. Each encoding form has different advantages over the others, and each is considered preferable in different contexts.
Because there are three different encoding forms to choose from, it will be necessary to make choices regarding which to use in the context of online data archives. I will return to this in section 3.4.
It is important to understand that Unicode distinguishes between characters and glyphs. This distinction is analogous to that between a morpheme and an allomorph. A character is a unit of textual information. It carries semantic information and properties, and also has a typical or nominal shape associated with it. A glyph, on the other hand, is a particular image that is used to represent a character.
There is not always a one-to-one relationship between characters and glyphs. A single character may appear in different shapes in different contexts (analogous to phonemes and allophones). So, for example, most Arabic letters appear in one of four shapes according to their position within a word: initial, medial, final or isolate.
| Character | Glyphs | |||
|
|
|
|
|
|
Is some cases, different characters can also be displayed using the same glyph. For instance, GREEK CAPITAL LETTER ALPHA can be displayed using the same glyph as (U+0391) LATIN CAPITAL LETTER A (U+0041), within certain limitations on the typeface (for example, a Fraktur glyph would not be appropriate for Greek alpha).
Also, a sequence of characters may merge into a single glyph, known as ligatures or conjuncts (analogous to portmanteau morphs). So, for example, the Devanagari syllable "ksha" is composed of a sequence of three characters, but is presented using a single glyph.
| Character sequence | Glyph |
|
|
|
There can also be cases in which a single character corresponds to multiple glyphs (analogous to discontinuous morphemes). So, for example, the character BENGALI VOWEL SIGN O (U+09CB) corresponds to a pair of glyphs, one before and one following the character for the syllable-initial consonant.
| Character sequence | Glyph sequence |
| < U+0995 BENGALI LETTER KA, U+09CB BENGALI VOWEL SIGN O > |
|
In general, the relationship between characters and glyphs is many-to-many.
As a rule, Unicode encodes characters but not glyphs. For reasons of backward compatibility with existing encoding standards, it was necessary to make some exceptions to this rule. So, for instance, the contextual forms required for Arabic presentation are all directly encoded in Unicode because they had been directly encoded in one or more legacy industry standards. These presentation forms are convenient for rendering, but they make many other processes more complicated and less efficient. Thus, even though these Arabic presentation forms are encoded in Unicode, their use is discouraged.
Linguists are typically very familiar with the notion of directly encoding various presentation forms of characters. This was necessary in the past because software did not provide any other mechanisms for rendering text where complex script behaviours were involved. So, for example, the SIL IPA93 fonts directly encode four positional variants of the acute accent (o- and i-width overstrikes for both lower and upper case base characters). But linguists also want to perform analytical processes on their phonetic transcription data, and these presentation-form distinctions make that more difficult.
Unicode assumes that distinctions of this sort that pertain to rendering do not belong in the character encoding. The standard was designed with the assumption that it would be implemented in software that would address the needs of rendering entirely within the rendering process. This requires what is often referred to as smart font rendering technologies. Such technologies are covered in other papers from this workshop, so I will not elaborate on them here.
The significance of the character-glyph model for purposes of linguistic and language data archives is this: it is recommended that data be encoded without directly encoding presentation forms. This assumes that those who contribute to and make use of the archives will have access to software that provides the necessary rendering support. This is a concern in the short term, but such technologies are beginning to become available, and are expected to be widely available within a few years.
In order to use Unicode as the character encoding for linguistic and language data archives, those creating the content that will go into the archive need to know how to encode the characters in their writing systems in Unicode. In this regard, it is important to understand that the Unicode character repertoire is not based on a direct representation of orthographies. Rather, it uses a slightly abstract notion of characters that is language- and writing-system neutral. Unicode uses the notion of abstract character, which is defined as a "unit of information used for the organization, control, or representation of textual data." [TUS 3.0, p. 40.] Thus, a grapheme that constitutes a functional unit within the writing system and orthography of some language will not necessarily correspond to a single, distinct Unicode character. In many cases, a grapheme may be represented by a combination of the representational units from Unicode's abstract character repertoire.
This distinction is important to understand. New users have often concluded that Unicode does not support certain characters that they need without understanding that these characters are supported in some way that the user wasn't aware of.
First, in this regard, Unicode allows dynamic and productive use of combining marks. For example, suppose the orthography for some language includes a letter "c with tilde". Unicode contains many precomposed Latin letter-with-diacritic combinations, but not this particular combination. Nevertheless, this grapheme is supported in Unicode using a character sequence:
| Grapheme | Unicode character sequence |
|
|
< U+0063 LATIN SMALL LETTER C, U+0303 COMBINING TILDE > |
As mentioned, use of combining marks is productive. It is possible to represent multiple diacritics, if needed:
| Hypothetical combination | Unicode character sequence |
|
|
< U+0063 LATIN SMALL LETTER C, U+0324 COMBINING DIAERESIS BELOW U+032A COMBINING BRIDGE BELOW U+0303 COMBINING TILDE, U+0306 COMBINING BREVE, U+0301 COMBINING ACUTE ACCENT > |
Secondly, many requests have been made to encode digraph characters on the basis that "this is a separate letter in our language and has its own place in the sort order." A familiar example would be Spanish "ch": both "c" and "h" occur as separate graphemes in Spanish orthography, but "ch" is traditionally treated as a separate letter, and sorts between "c" and "d" (i.e. "cha" would sort after "cu" rather than before "ci"). Sorting is a text process that relies on but is distinct from encoding, however. It has been possible for years to sort Spanish words in traditional order without needing to encode a distinct "ch" character.
It is important that these matters be understood not only by those creating the content that makes up an archive, but also by those developing software for use with the archived data. For example, developers need to understand that counting "characters" (in the orthographic sense) is not merely a matter of counting Unicode characters. In general, there is a many-to-many relationship between Unicode characters and graphemes, and the relationship is potentially language-dependent.
A third potential problem in recognizing how characters are to be encoded results from not understanding the intention of the character charts provided in the Unicode standard: the glyphs shown for each character are intended as representative glyphs. In some cases, two different languages can require different shapes for graphemes that would be represented using a single Unicode character. For example, two different shapes for upper case "eng" are preferred by different language communities in Papua New Guinea:
| Upper & lower "eng" (variant 1) | Upper & lower "eng" (variant 2) |
|
|
|
The character charts published by Unicode show only the second variant for the upper case eng. Someone looking for the first shape might conclude that a new character needs to be added to Unicode. Shape variations from one language to another are not a sufficient basis for character distinctions in Unicode, however. In both situations, the upper case letter is encoded in Unicode as U+014A LATIN CAPITAL LETTER ENG. The difference in glyphs is handled in the rendering process, possibly using different fonts or using language-based glyph selection (if supported by the software being used).
Finally, in a small number of instances, Unicode includes what might initially appear to be duplicate characters. In these cases, a user might easily choose the wrong character. For instance, the following characters have the same shape, but are different:
|
|
U+2019 RIGHT SINGLE QUOTATION MARK |
|
|
U+02BC MODIFIER LETTER APOSTROPHE |
These two characters function differently: the first is a punctuation character, and is not word-forming, while the second functions as a word-forming letter. In Unicode, these two characters are distinguished by character semantics.
Unicode character semantics refer not to how characters are interpreted linguistically in terms of the phonology of a given language, but to how characters relate to other characters and to how they behave in relation to text processing. Unicode defines a variety of properties for every character, many of which are provided to control how characters behave in relation to processes such as line breaking. These properties are an intrinsic part of Unicode characters, and to a large extent, it is the semantic properties that define a Unicode character.
It is important to understand in this that Unicode specifies character properties of two types: normative and informative. Normative properties are a formal part of the definition of the standard, and are mandatory for implementations that follow the standard. Informative properties, on the other hand, are helpful guides, but are not required to be followed, in some cases because they are not appropriate for every situation. For example, U+0069 LATIN SMALL LETTER I has the normative property of being a lower case letter. Another property for this character is an upper-case mapping to U+0049 LATIN CAPITAL LETTER I, but that property is an informative property.
| Character | U+0069 LATIN SMALL LETTER I |
| Category (normative) | lowercase letter |
| Upper case mapping (informative) | U+0049 LATIN CAPITAL LETTER I |
The reason is that different case mappings may be required for particular languages. Such is true in the instance of Turkish, for which the lower-case dotted "i" maps to an upper-case dotted "i", U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.
| Typical case mapping | Turkish case mappings |
|
|
|
Informative properties have secondary importance in defining a character. Thus, if a user is evaluating Unicode characters to determine how to represent particular graphemes, informative properties can help them to understand the typical use of a Unicode character, but if these properties do not match the behaviours of a given grapheme, that does not mean that this character cannot be used to represent the grapheme. The user should look particularly closely at the normative properties, however, and if a normative property does not match for the behaviour of a given grapheme, then that indicates that a different Unicode character is needed. (For more information on Unicode character properties, see chapter 4 of [TUS 3.0].)
We saw earlier that shape was not a sufficient basis for distinguishing Unicode characters, and now we have seen that it is also not necessary for distinguishing characters. A character distinction can be justified if two candidates are used and distinguished within a given writing system, or if they differ in their normative character properties, whether or not they co-occur in any single writing system.
We have seen that Unicode allows for dynamic and productive composition of characters using combining marks. It was also mentioned that Unicode aims not to encode presentation forms. Taking these points together, one may find it surprising that Unicode includes a number of precomposed base character-plus-diacritic combinations. In principle, if a rendering system can deal with dynamic composition, then it is not necessary for any precomposed combinations to be directly encoded. For reasons of backward compatibility with existing industry standards, however, it was necessary to include a number of precomposed combinations in Unicode.
In section 2.3, it was mentioned that several Arabic presentation forms were also encoded in Unicode for backward compatibility with existing standards. There is a significant difference, however, between an Arabic presentation form, such as U+FEEA ARABIC LETTER HEH FINAL FORM, and a precomposed base-diacritic combination, such as U+00E1 LATIN SMALL LETTER A WITH ACUTE. Note that the Arabic presentation form is not unconditionally interchangeable with its counterpart, U+0647 ARABIC LETTER HEH. For example, ARABIC LETTER HEH can occur in any word position, but not ARABIC LETTER HEH FINAL FORM. In contrast, the precomposed combination LATIN SMALL LETTER A WITH ACUTE is unconditionally interchangeable with its decomposed counterpart, the combination <LATIN SMALL LETTER A, COMBINING ACUTE ACCENT>. In cases in which two character sequences are unconditionally interchangeable in this manner, Unicode asserts that these sequences are fully synonymous and equivalent character representations.
In Unicode, this relationship of synonymy is referred to as canonical equivalence. This relationship is specified as a normative property of characters. For example, the character U+00E1 LATIN SMALL LETTER A WITH ACUTE has as one of its normative properties the canonical decomposition mapping to the sequence <U+0061 LATIN SMALL LETTER A, U+0301 COMBINING ACUTE ACCENT>. Because of this, U+00E1 is canonically equivalent to the sequence < U+0061, U+0301 >, and the two representations are considered to be fully synonymous and equivalent.
In a small number of instances, a Unicode character has a canonical decomposition of another apparently identical character. For example, U+037E GREEK QUESTION MARK has a canonical decomposition of U+003B SEMICOLON. Apart from the name and decomposition, these characters are identical. As with the previous cases, these have been included in Unicode to maintain backward compatibility with source industry standards in which these character distinctions existed.
Given that Unicode allows for a text to have alternate, equivalent representations, there is a potential problem for processes that do any type of string comparison. For example, a user may be searching for LATIN SMALL LETTER A WITH ACUTE whereas the document may contain instances of the combination < LATIN SMALL LETTER A, COMBINING ACUTE ACCENT>. In order to deal with this problem, Unicode defines certain normalization forms that remove these artificial distinctions.
As described in Unicode Standard Annex #15 [UAX 15], artificial distinctions related to canonical equivalence are handled using either of two normalization forms: normalization form D (NFD) and normalization form C (NFC). The normalization for NFD specifies that the text be transformed into its maximally decomposed representation, so that no remaining character has a canonical decomposition. NFC is conceptually the opposite: text in normalization form C is represented using precomposed characters as much as possible. (See [UAX 15] for details regarding how each normalization form is generated.) Both normalization forms provide unique representations for a text—for any string, there is only one decomposed normal form and one precomposed normal form for that string. Thus, either can serve as a reference point for comparison with other strings.
When multiple combining marks co-occur on a base character, the combining marks may or may not interact typographically with one another. For example, if two marks occur above the base character, they must be positioned in some order relative to one another, and these differences in relative position can affect the meaning of the text. In Unicode, the relative order would be encoded in terms of the order of the character codes in the file:
|
|
< U+0061 LATIN SMALL LETTER A, U+0302 COMBINING CIRCUMFLEX ACCENT, U+0303 COMBINING TILDE > |
|
|
< U+0061 LATIN SMALL LETTER A, U+0303 COMBINING TILDE, U+0302 COMBINING CIRCUMFLEX ACCENT > |
In contrast, if two combining marks do not interact typographically, there is no possibility of changing the meaning of the text by changing relative positioning. Nevertheless, the two combining characters must come in some relative order in the data stream. Thus, we have another situation in which different character representations can mean the same thing.
Processes need to know whether or not the relative ordering of combining characters within a data stream have any semantic significance. This is handled by assigning every combining mark to a particular combining class, which is a normative property of characters. In situations where the relative ordering of combining characters is not semantically significant, the artificial distinction reflected in the ordering difference is eliminated as part of the normalization process. This is true for each of the normalization forms defined by Unicode.
Thus, a string may have a number of different equivalent representations that differ in terms of precomposed or decomposed representations and in ordering of combining characters. As mentioned above, however, there is a unique NFD representation, and a unique NFC representation. This is illustrated in the following example:
| Visual appearance |
|
| Encoded character sequence | < U+1EA0 LATIN CAPITAL LETTER A WITH DOT BELOW, U+0306 COMBINING BREVE > |
| Alternate, equivalent representation | < U+0102 LATIN CAPITAL LETTER A WITH BREVE, U+0323 COMBINING DOT BELOW > |
| Second alternate, equivalent representation | < U+0041 LATIN CAPITAL LETTER A, U+0306 COMBINING BREVE, U+0323 COMBINING DOT BELOW > |
| Equivalent representation in NFD | < U+0041 LATIN CAPITAL LETTER A, U+0323 COMBINING DOT BELOW, U+0306 COMBINING BREVE > |
| Equivalent representation in NFC | U+1EB6 LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW |
Since it is possible for text to be represented in several different but equivalent ways, this raises an issue for data held in online archives. This matter will be discussed further in section 3.3.
While an Arabic presentation form such as U+FEEA ARABIC LETTER HEH FINAL FORM is not unconditionally interchangeable with the corresponding character U+0647 ARABIC LETTER HEH, there clearly is a close relationship. If the presentation form were to be replaced with the other character, there is some minor meaning lost that pertains to rendering, but there is also important meaning that is retained. Note that the information that relates to rendering would be redundant if adequate rendering support is available, but would be necessary otherwise. Here we have a case of near synonymy: in many situations, a user would not see any need to distinguish the two characters, but in some situations the distinction may be important.
There are 2,172 characters in Unicode of this sort—near-synonyms of other characters. These characters were all included for purposes of backward compatibility with source industry standards and are known as compatibility characters. The relationship between a compatibility character and its near synonym is captured in Unicode in the same manner as canonical equivalence: a normative decomposition mapping is provided for compatibility characters. This type of decomposition is known as a compatibility decomposition, as opposed to a canonical decomposition. Compatibility decompositions are distinguished from canonical compositions in that the mapping includes some kind of non-character information, as shown in the following examples:
| Compatibility character | Decomposition | |
|
|
U+FEEA ARABIC LETTER HEH FINAL FORM | <final> 0647 |
|
|
U+FF41 FULLWIDTH LATIN SMALL LETTER A | <wide> 0061 |
|
|
U+00B9 SUPERSCRIPT ONE | <super> 0031 |
|
|
U+02B0 MODIFIER LETTER SMALL H | <super> 0068 |
|
|
U+210E PLANCK CONSTANT | <font> 0068 |
|
|
U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING | <compat> 0061 02BE |
|
|
U+0E33 THAI CHARACTER SARA AM | <compat> 0E4D 0E32 |
In the case of Arabic presentation forms, the non-character information has to do with contextual rendering behaviour. This is not always the case, however, as can be seen in the examples above. As with the Arabic presentation forms, the non-character information will not be important in many situations, but in certain situations it may be considered important.
When text is normalized into normalization forms D and C, the distinction between compatibility characters and their near synonyms is retained. Unicode provides two other normalization forms, however, in which these distinctions are eliminated: normalization form KD (NFKD) is a decomposed form without compatibility characters, and normalization form KC (NFKC) is a precomposed form without compatibility characters. As with forms NFD and NFC, any string has a unique NFKD or NFKC representation which can be used as a reference for comparison with other strings.
Since forms NFKD and NFKC do not contain any compatibility characters, normalizing a string into one of these forms will result in some loss of information unless that information is retained in some other way, such as markup or as out-of-band information. It is important that these normalization forms be used with care so as not to lose information that may be important in a given situation. This has some relevance for archived linguistic data. This matter will be considered further in section 3.1.
In some situations, users may find that the characters they need are not supported in the current version of Unicode. This would be true in the case of less common scripts that have not yet been added to the standard, such as Vai, Silheti Nagri or Lanna, or in the case of uncommon characters from scripts that are already well covered in Unicode. There can also be situations in which a character is not a candidate for addition to Unicode, at least at the present time. This would be the case, for example, with novel Ethiopic characters that are being considered for use in a particular language community but that are not yet stabilized in usage. Whenever characters are candidates for addition to Unicode, the Unicode Consortium welcomes having input into the further development of the standard in the form of proposals for new characters. Processing such requests does take time, however, especially if sufficient information about the proposed characters is slow in being provided. In all of these situations, users may have a need to create custom character assignments. Unicode reserves space in its encoding known as Private Use Area (PUA) ranges specifically for this purpose.
There are two PUA ranges. The first is in the so-called basic multilingual plane (BMP), from U+E000 to U+F8FF. The second is in the so-called supplementary planes from U+0F000 to U+1FFFF (planes 15 and 16). It is guaranteed that no characters will ever be assigned in these ranges as part of the standard, and individual users or corporations are free to use any of the codepoints in these ranges for proprietary purposes.
An important limitation of PUA characters is that their meaning is unknown apart from any specific, and separate, documentation. This is exactly the same problem that exists with custom fonts that use custom character sets. One particular concern is that applications may not have any way of knowing what PUA characters mean and so will not provide the behaviours and results that the user expects. These issues are discussed further in section 3.5.
There are some particular encoding issues that are relevant for establishing an online archive of language and linguistic data that require further consideration. These have to do with the representation of tone in phonetic and phonemic transcription, with superscript letters for phonetic and phonemic transcription, and with the use of compatibility characters in general. I'll consider each of these in turn.
In phonetic and phonemic transcription, IPA [IPA, pp. 13ff] allows for the use of either tone letters or tone diacritics. All of the tone letters are currently supported in Unicode. Note that tone contours are represented by sequences of these characters, with the assumption that a "smart font" rendering system will display these using ligature glyphs corresponding to the various contours.
| Tone letter | Unicode representation |
| U+02E5 MODIFIER LETTER EXTRA-HIGH TONE BAR | |
| U+02E6 MODIFIER LETTER HIGH TONE BAR | |
| U+02E7 MODIFIER LETTER MID TONE BAR | |
| U+02E8 MODIFIER LETTER LOW TONE BAR | |
| U+02E9 MODIFIER LETTER EXTRA-LOW TONE BAR | |
|
|
< U+02E7, U+02E8, U+02E5 > |
(Some prefer tone letters with vertical bars on the left. This is simply a font variation and does not require distinct encoding.) Some tone diacritics are currently supported in Unicode, but not all:
| Tone diacritic | Description | Unicode representation |
| Extra high level | U+030B COMBINING DOUBLE ACUTE ACCENT | |
| High level | U+0301 COMBINING ACUTE ACCENT | |
| Mid level | U+0304 COMBINING MACRON | |
| Low level | U+0300 COMBINING GRAVE ACCENT | |
| Extra low level | U+030F COMBINING DOUBLE GRAVE ACCENT | |
|
|
High rising contour | (not supported) |
|
|
Low rising contour | (not supported) |
|
|
Rising-falling contour | (not supported) |
In addition, some like to use superscript numbers to indicate pitch levels. Unicode already supports a complete of superscript digits, 0 to 9:
| Superscript digit | Unicode representation |
| 0 | U+2070 SUPERSCRIPT ZERO |
| 1 | U+00B9 SUPERSCRIPT ONE |
| 2 | U+00B2 SUPERSCRIPT TWO |
| 3 | U+00B3 SUPERSCRIPT THREE |
| 4 | U+2074 SUPERSCRIPT FOUR |
| 5 | U+2075 SUPERSCRIPT FIVE |
| 6 | U+2076 SUPERSCRIPT SIX |
| 7 | U+2077 SUPERSCRIPT SEVEN |
| 8 | U+2078 SUPERSCRIPT EIGHT |
| 9 | U+2079 SUPERSCRIPT NINE |
Tones could also be represented using these characters. One concern with the superscript digits, however, is that these are compatibility characters, and the fact that they are superscripted would be lost when data is normalized into forms NFKD and NFKC. I will return to the problem of compatibility characters later in this section.
One possible concern of having different representations for tones is that they would not equated in string comparisons. If searching for certain tones in a text, a user would have to search for multiple representations. To avoid this problem, it could be decided to adopt one representation as a best common practice (BCP) for encoding of tones. The other alternative would be to define a character folding mapping that removes this distinction and persuade developers to incorporate support for this folding into their software.
If a single representation for tones is to be adopted as a BCP recommendation, the best choice for this would be the tone letters, U+02E5 to U+02E9. These are preferable for several reasons: these are already supported in Unicode and are sufficient for all tones, they are not compatibility characters, and they are specifically intended for representing tones and so are unambiguous. If it were considered important to be able to display data using one of the other representations, it would be possible to handle this by transforming to one of the other representations as data is served up from the archive and delivered to the user. It could also be handled in the rendering process, for example by using a font feature to alter the rendering behaviour of the font to show superscript numbers or tone diacritics even though the underlying characters are the tone letters.
In creating an infrastructure for a cooperative archive for linguistic data, it should be decided whether or not to make a BCP recommendation as to how tone should be represented. For the reasons described above, I would suggest that a consistent representation be adopted using the tone letters.
If such a BCP recommendation is made, consideration will have to be given as to whether it is necessary to present data using one of the other conventions for representing tone, and, if so, what mechanisms would be used to implement this. (If this were to be implemented using font features, it may be necessary to request that new features be defined for the various "smart font" formats—OpenType, AAT and Graphite—in order to handle this rendering transformation.)
If multiple representations for tone are to be allowed in data, it would probably be important to request that the tone diacritics for contour tones be added to the Unicode standard. (The linguistics community may want to do this even if it is decided not to use the tone diacritics in the online archive.) The matter of character folding for comparison purposes would also deserve further consideration.
The 1996 revision of IPA as specified in the IPA Handbook [IPA] is an important convention for phonetic and phonemic transcription, and most of the symbols are already supported in Unicode. There are a few symbols that are not currently supported, however, in addition to the contour tone diacritics mentioned above. Also, there are a number of other traditions for phonetic and phonemic transcription, and some of the symbols used in those various traditions are not currently in Unicode. Proposals can be submitted to add some or all of these symbols. I want to focus briefly on the use of superscript symbols in particular.
IPA allows for the use of various superscript letters, to indicate aspiration, for example. The following are already supported in Unicode:
| Superscript letter | Description | Unicode representation |
| aspirated | U+02B0 MODIFIER LETTER SMALL H | |
| palatalized | U+02B2 MODIFIER LETTER SMALL J | |
| labialized | U+02B7 MODIFIER LETTER SMALL W | |
| velarized | U+02E0 MODIFIER LETTER SMALL GAMMA | |
| |
pharyngealized | U+02E4 MODIFIER LETTER SMALL REVERSED GLOTTAL STOP |
| nasal release | U+207F SUPERSCRIPT LATIN SMALL LETTER N | |
| lateral release | U+02E1 MODIFIER LETTER SMALL L | |
| voiceless velar fricative release | U+02E3 MODIFIER LETTER SMALL X |
Some superscript letters used in earlier revisions of IPA or in other traditions are also supported. A few examples follow:
| Superscript letter | Description | Unicode representation |
| |
breathy voiced, murmered | U+02B1 MODIFIER LETTER SMALL H WITH HOOK |
| palatalized | U+02B8 MODIFIER LETTER SMALL Y | |
| ejective | U+02C0 MODIFIER LETTER GLOTTAL STOP |
Unicode does not currently support two of the superscript letters in IPA. Various superscript letters used in other traditions are also missing. Superscript nasals used to show prenasalization are a familiar example:
| Superscript letter | Description |
| voiceless dental fricative release | |
| mid central vowel release | |
| prensalization |
(Note that some of these may also used in transliteration traditions for various scripts. For example, the superscript schewa is used in transliterations of Hebrew to indicate voiceless (elided) Hebrew schewa pointing.) In some traditions, superscripts can also be used to indicate other secondary articulations, such as an oral stop onset to a nasal stop:
| Symbols | Description |
![]() |
voiced bilabial nasal with oral onset |
Some traditions also use superscripts to reflect phonemic interpretations, for example to indicate that a vowel quality is an off-glide of a diphthong phoneme:
| Symbols | Description |
| open-mid front unrounded vowel with near-close near-back rounded off-glide |
In principle, within such traditions, superscript versions of a large number of consonant and vowel symbols could potentially be used for purposes such as these. There is potential need, therefore, for a significant number of additional superscript letters to be supported in Unicode.
I highlight these potential character needs precisely because the existing superscript characters in Unicode are compatibility characters, thus there is potential for the information encoded in terms of the use of superscripting to be lost under normalization. The linguistics community in general, and a cooperative defining encoding policies for a cooperative archive, needs to decide what superscript phonetic symbols they need to be able to represent in Unicode-encoded documents, and what the best representation would be.
The immediately obvious means of encoding these superscript symbols is by adding each of them as distinct to Unicode. This has the potential disadvantage of being a compatibility character, but the benefit of allowing for simple implementations, and of using a mechanism that is consistent with the existing recommendations in the IPA Handbook for encoding other superscript symbols.
One alternative would be to use a single control character that combines with another phonetic character to indicate that that character is to be superscripted. For instance, something along the lines of < "b", U+xxxx ZERO-WIDTH SUPERSCRIPT MODIFIER, "m" > could potentially be used to represent a nasal with oral onset. This solution has significant drawbacks, however: it requires a non-trivial and non-transparent implementation involving a new and ad hoc mechanism. There is also potential for implementations to abuse the mechanism, applying it to situations for which it was not intended. (Such a proposal would not likely be given serious consideration by the Unicode Consortium.)
Another possibility is simply to use formatting controls within an application. This should not be done, however. Formatting controls generally require the use of proprietary file formats, which are not good choices for archived data. Furthermore, such out-of-band information is not readily accessible to analytical processes, such as searches, and it is lost when text is exported to plain text.
Yet another possibility involves markup: given that "superscript" can be construed as non-character information, it might not be unreasonable to represent these semantics using markup. So, for example, something like the following might be a possible representation for a nasal with oral onset:
... oro<doublearticulation><secondary>b</secondary>m</doublearticulation>a...
This would be more appealing if similar markup representation were used for other elements of phonetic transcription, such as prosodic elements of an utterance. This might seem an awkward way to represent phonetic transcriptions to some, though perhaps not to anyone familiar with [MathML] markup.
There are disadvantages, though. The most obvious is that this is a somewhat awkward representation for something that is typically perceived as text that doesn't have particularly complex layout issues and has not generally been viewed as representing much in the way of internal structure. (Mathematical formulas are markedly different in both of these regards, which is why a rich markup language makes sense for that domain.) If linguists decided that they wanted to record phonetic transcriptions that included significant structural information of some sort, a markup language might be a good way to do this. For conventional phonetic transcription, however, there is no particular need for markup except, perhaps, as a means to handle the particular issue of superscript symbols.
Furthermore, even if markup were devised for phonetic and phonemic transcription, it is not clear how well this could be adapted for other uses, such as with Hebrew transliteration. Those uses will generally require very few superscript characters, which makes the use of markup even less appealing.
Additional disadvantages are that it would make it more difficult for users to enter data, and it would require a non-trivial and novel implementation that uses a different mechanism than is used for the existing recommendations in the IPA Handbook for encoding other superscript symbols (unless those were to be abandoned in favour of a markup mechanism).
In view of the pros and cons, the best solution for representing additional superscript symbols appears to be to request that they be added to Unicode as distinct characters, in spite of the fact that they would be compatibility characters with "superscript" treated in the decompositions as non-character information. This leaves us in a confusing situation, though, since compatibility characters are generally to be avoided (see below).
The question remains, though, as to just what additional superscript symbols are required for use in archived linguistic data. This issue needs to be resolved by the linguistic community that will be participating in the distributed online archives we are attempting to develop.
I have considered the need for superscript letters, which are a specific case in the more general class of compatibility characters. There are many other sub-classes of compatibility characters. Some of these might be acceptable for use in archived data, given the same caveats expressed for superscripts. In general, however, compatibility characters should not generally be used. I will return to this in section 3.3.
As mentioned in section 2.1, character encoding and document encoding are distinct levels of information representation, but there can be situations in which semantic content might reasonably be represented in either level, either in terms of markup or in terms of encoded characters. For example, paragraphs can be delimited using characters such as carriage return (U+000D) or U+2029 PARAGRAPH SEPARATOR, but this is typically done in markup languages using markup, such as <p>... </p>.
These issues are discussed in general in Draft Unicode Technical Report #20 [DUTR 20]. That document also makes recommendations for some specific situations in which markup-based control is to be preferred over character-based control. It is recommended that those recommendations be adopted for online archives.
I cannot anticipate all of the possible points in the development of document standards for archived linguistic data at which we may need to choose between character- or markup-based representation. I will discuss two points very briefly.
The first has to do with superscript characters. As mentioned in the previous section, one possible solution to concerns over the use of compatibility characters would be to devise markup to represent semantics that would otherwise be represented in terms of compatibility characters. Various disadvantages to such a solution were pointed out. I would point out here some additional observations drawn from [DUTR 20, section 2]. Encoding text as a sequence of characters results in a linear structure. In contrast, markup provides a hierarchical structure. Markup is more appropriate, therefore, for representing information that is hierarchical in nature, whereas characters are more appropriate for information that is linear in nature. In addition, markup is suited to control over spans of information is required, but is not very efficient where very local control is needed. Indeed, markup that is used in local contexts can be intrusive and get in the way of processes that operate on the character data. Where local control is needed, character codes are, in general, preferable. Phonetic and phonemic transcription is essentially linear as opposed to being hierarchically structured. (This is true, at least, in the ways in which it has been done in the past and in which IPA has been designed.) Also, superscripting is applied in very local contexts. In view of the preceding observations, both of these factors argue in favour of character-based representation for semantic information embodied in superscript phonetic symbols, and against the use of markup.
The second has to do with annotated, interlinear text. Some may have observed that Unicode has three characters that are specifically intended for interlinear text annotations:
| U+FFF9 INTERLINEAR ANNOTATION ANCHOR |
| U+FFFA INTERLINEAR ANNOTATION SEPARATOR |
| U+FFFB INTERLINEAR ANNOTATION TERMINATOR |
Given linguists' intererst in annotated, interlinear text, some might wonder if these characters are somehow intended for use with the kind of interlinear text that linguists are accustomed to dealing with. I simply wish to point out that these characters are not intended for this purpose. They are intended for a specific type of annotation text typically used for Japanese that is known as furigana. (These are short annotations beside kanji ideographs to indicate pronunciation.) Furthermore, these characters are specifically intended only for internal use by software processes, and not for storage or transmission of text. These characters should not, therefore, be used in archived data. (Note that this is one of the specific recommendations made in [DUTR 20].)
Given that some character sequences have alternate, synonymous representations, we need to consider what the implications are for archived data, and whether a single, normalized representation should be used.
The decomposed and composed each have benefits in different situations. The decomposed representation is probably preferable for many types of analytical processing. For example, writing a search process that ignores diacritics is easier it the data can be assumed to be decomposed. Decomposed data also has an internal consistency in how it represents characters, which has a certain appeal. On the other hand, it is not difficult for an application developer to include a normalization step that precedes analytical processes that require decomposed data. In addition, many existing implementations exist that work with the composed representation for many Latin base-diacritic combinations. For example, the composed forms for European languages are used in Microsoft Windows.
There is an important argument in favour of using NFC: it is the recommendation made in a W3C working draft, Character Model for the World Wide Web [W3C CharMod, sections 4, 4.1]:
Character data interchange using W3C protocols and formats is based on the principle of early normalization, which defines the exact form to which text data has to be normalized, and the cases in which normalization must be applied... Text data is in normalized form according to this specification if all of the following apply:
- [Issue: Normalizing out escapings] All escapings that are not syntactically relevant and that are not needed because of the limitations of the encoding used are replaced by the actual characters.
- It is in Unicode Canonical Composition (Normalization Form C) according to [UTR #15]. Note: The cutoff version is Version 3.0 of the Unicode Standard (planned to be identical to the next edition of ISO/IEC 10646-1). Using Version 3.0 of the Unicode Standard as a cutoff version for Normalization Form C does not mean that characters not in Version 3.0 cannot be used in Web documents; it just means that precomposed characters added after Version 3.0 will have to be represented as decomposed.
- It does not include some strongly discouraged codepoints...
(Note: the characters that are discouraged from use are enumerated in [DUTR 20]. That document is still in draft status, so it is possible that the list of characters to be avoided may change. I think it is unlikely that any changes will be made in relation to characters already in version 3.0 of Unicode, however.) The intent regarding early normalization is that data should be normalized into NFC at the earliest possible opportunity. The current thinking is that any recipient should be able to assume that data is in NFC.
This recommendation is intended to apply to all protocols and file formats for the World Wide Web. This includes HTML and XML. It is addressed to the authors of those specifications, rather than to users and developers; that is, the expectation is that the authors of the specification for, say, XML, would add these requirements into that specification. This is still a working draft, and other specifications such as [XML] do not currently reference it. Thus, we could potentially disregard it. We should perhaps anticipate the possibility that it will become a requirement by other standards in the future.
We have considered the choice between decomposed and composed normalizations. The issue of compatibility characters remains. It would be an option to consider recommending not merely NFC be used, but that the more restrictive NFKC be used. (Note that data in NFKC is, by definition, also in NFC.) In this regard, it should be noted that [TUS 3.0] generally discourages the use of compatibility characters. For example, "As with other compatibility characters, the preferred Unicode encoding is to use the nominal counterparts of these characters and use rich text font or style bindings to select the appropriate glyph size and width" (p. 274). Its statements are not entirely consistent, however:
Compatibility characters are those that would not have been encoded (except for compatibility) because they are in some sense variants of characters that have already been coded...
Identifying one character as a compatibility variant of another character implies that generally the first can be remapped to the other without the loss of any information other than formatting. Such remapping cannot always take place because many of the compatibility characters are in place just to allow systems to maintain one-to-one mappings to existing code sets. In such cases, a remapping would lose information that is felt to be important in the original set... Because replacing a character by its compatibility equivalent character or character sequence may change the information in the text, implementation must proceed with caution. A good use of these mappings may not be in transcoding, but rather in providing the correct equivalence for searching and sorting.
[TUS 3.0, p. 17, emphasis added.]
This is not entirely easy to interpret. The situations in which loss of information appears to be a concern is specifically when one-to-one mapping with certain legacy encoding standards is required. For our purposes, I will assume that legacy encodings are not an issue.
The specific situation for which compatibility characters is a concern for us is with superscript characters used in phonetic and phonemic transcription. In these situations, mapping to compatibility decompositions would result in a loss of information important to linguists. Thus, the suggestion that a "good use of these mappings may [be]... in providing the correct equivalence for searching and sorting" would be incorrect.
Apart from superscript characters, it appears that most or all other compatibility characters could be avoided in archived data. Unfortunately, it is not an easy matter to enumerate exactly which might be acceptable to use, and which are not. In general, a careful reading of the standard [TUS 3.0, chapters 6–13] is necessary. The following are some initial guidelines for archived language and linguistic data:
A limited number of individual relevant cases remain. For example, U+00A0 NO-BREAK SPACE; U+017F LATIN SMALL LETTER LONG S; U+0675 ARABIC LETTER HIGH HAMZA ALEF; U+0EB3 LAO VOWEL SIGN AM; U+0F77 TIBETAN VOWEL SIGN VOCALIC RR; or U+2011 NON-BREAKING HYPHEN. Space does not permit discussion of all of these here, however. Furthermore, the implications for using or not using some of these is not entirely clear to me at present. My impressions are, though, that all of them (apart from the superscripts in question) can be avoided.
This leaves us in somewhat of an awkward position. In general, we can recommend that compatibility characters not be used, except in the case of the specific superscript characters (many of which have yet to be added to the standard) that are used for phonetic and phonemic transcription or for other linguistic purposes, such as transliteration. We have these alternatives:
I am not sure what the best resolution to this matter is. I suspect that it would be not be easy to convince the Unicode Consortium to pursue the third option.
As mentioned in section 2.2, Unicode supports encoding forms based on 8-, 16- and 32-bit data types, known respectively as UTF-8, UTF-16 and UTF-32. In general, UTF-16 is considered the default encoding form for Unicode. The others are also available for use, however, and some software may support one but not the others. In creating an infrastructure for an online archive of language and linguistic data—establishing file formats and creating software tools—and in beginning to add content to the archive, some consideration must be given to the encoding form options.
Each encoding form has certain benefits. For most applications, UTF-16 is a good choice for the memory representation of characters since most of the characters that the largest group of users will be interested in are represented by a single 16-bit value. Maintaining a consistent encoding length allows for efficient processing, and efficient tests for surrogate code values (used to represent characters in the range U+10000 to U+10FFFF) are not difficult to implement. UTF-16 is a good compromise in terms of storage requirements: most characters that would typically be encountered all require 16 bits of storage, whereas in UTF-8 a large number of commonly-used characters (U+0800 to U+FFFF) require 24 bits of storage, and in UTF-32 all characters require 32 bits of storage. (For phonetic transcription data, UTF-8 would provide a very slight improvement in storage requirements over UTF-16.) Also, many application programming interfaces (e.g. Win32) are written assuming UTF-16 as the encoding form.
In many contexts, though, UTF-8 has key benefits. Some operating system environments, notably Unix, have significant components that have never been revised to use 16-bit or 32-bit encoding forms. More generally, there are many existing software implementations that will work with an 8-bit text encoding form, but not with 16- or 32-bit encoding forms. This is true, for example, of many email client applications. In these contexts, clearly, UTF-8 has advantages. UTF-8 is also more efficient with respect to storage requirements for texts that contain predominantly characters from the ASCII character set, since each of these characters requires only 8 bits of storage in UTF-8.
The XML 1.0 specification [XML, section 2.2] assumes Unicode as the default character set and states that "[a]ll XML processors must accept the UTF-8 and UTF-16 encodings." In practice, UTF-8 has been more commonly used in XML documents, and it is possible that some XML processors might not be fully conformant and support only UTF-8 (I am not fully abreast of the status of currently shipping XML processors in this regard). For HTML, the HTML 4.01 specification [HTML, sections 5.1 and 5.2] identifies the HTML document character set as Unicode, but it does not assume any encoding form as a default, nor does it specify that browsers must support certain encoding forms. It does note, however, that UTF-8 is commonly used on the web. In practice, most current versions of web browsers support UTF-8; fewer support UTF-16.
In general, UTF-8 is of particular importance for the Internet by virtue of policy established in RFC 2277, IETF Policy on Character Sets and Languages [RFC 2277, section 3.1]:
Protocols MUST be able to use the UTF-8 charset... Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but lack of an ability to use UTF-8 is a violation of this policy; such a violation would need a variance procedure ([BCP9] section 9) with clear and solid justification in the protocol specification document before being entered into or advanced upon the standards track.
Thus, it is likely that we will see more protocols and software implementations for the Internet and the World Wide Web that support UTF-8 than that support UTF-16. For usage in these contexts, therefore, UTF-8 has definite benefits.
There are some specific contexts in which UTF-32 would be preferred over UTF-16 or UTF-8. If an application were to be created that was to be used with texts containing characters predominantly from the supplementary planes of Unicode (i.e. U+10000 to U+10FFFF), UTF-32 would be a better choice for internal memory representation of text since no processing of surrogate pair code values would be required. Also, in programming interfaces that deal with individual characters (e.g. for passing characters generated by an input method), UTF-32 is preferable since such interfaces typically use 32-bit data types already, and thus it allows any character to be passed in a single call. (For example, the WM_UNICHAR system message that was recently introduced into Win32 uses UTF-32).
In languages and protocols that are designed to allow users or implementers to describe character mappings where individual characters are referenced (e.g. for describing encoding conversions, transliteration processes, keyboard layouts, etc.), it is preferable to be able to refer to characters in terms of their Unicode scalar values. Generally, this will involve textual representation of hexidecimal values rather than a direct binary representation using one of the three encoding forms. For binary file formats that provide a compiled representation of such mappings, UTF-32 can be a good choice, though UTF-16 may also be a good choice depending on processing and file size considerations.
In view of these considerations, decisions should be made regarding recommendations for best common practice as to which encoding forms are considered preferable for various contexts. Given the observations I have made, I would suggest the following as a starting point:
For XML documents intended for viewing in Web browsers, UTF-8 may be the safest choice. On the other hand, any XML processor that would be built into a browser is supposed to support UTF-16, and UTF-16 may be a better storage option for data that is to be used in a variety of applications. I leave this matter for further consideration.
Regardless of which encoding form is used, content developers should always make use of whatever mechanism is offered for a given file format to provide explicit indication of the encoding used. For instance, XML files should always begin with an encoding declaration. (This is expressed in the W3C working draft [W3C CharMod] as something that content developers "MUST" do, using the terminology of [RFC 2119].)
As observed in section 2.7, there are various reasons why users may have need to define custom characters in Unicode's PUA ranges. It should be expected that there will be occasions in which content creators will want to submit content to an archive, and will have needed to use PUA characters. This raises certain issues that require further consideration.
First, in order for archived text data to be useful, the semantics of the encoded characters used in the data need to be known. Therefore, there will need to be a metadata standard to document PUA characters that are used in archived data. This will have to include not only general information, such as the script and a description of the character and its usage, but also the specific information that would be needed to implement support for this character. This includes the character properties defined in Unicode, but may also need to include additional information, such as rendering behaviour.
Secondly, Unicode does not define character properties for PUA characters (though defaults are suggested based on what is anticipated would be most common usage). As a result, applications that implement support for Unicode may have unpredictable behaviour when processing PUA characters. Application developers creating tools for use with archived language and linguistic data should provide mechanisms whereby users can specify character properties for PUA characters that they use. Preferably, common protocols or file formats should be used for this such that a given user-created specification of PUA character properties is portable between applications and, preferably, between platforms.
This is not necessarily a trivial problem to resolve, and there may be several ways in which it could be done. It is also a general problem that needs to be solved throughout the IT industry (though it is not as pressing a concern in some sectors as it is in others). This does not mean, though, that we should not pursue a solution for use within the archives that we hope to create. If useful industry standards are developed for this purpose, it would make sense to begin supporting those standards. It is reasonable to expect that it would be possible to convert existing specifications between any standards we might develop internal to our community and the newly-established industry standards.
Thirdly, there may be value in considering cooperative use of the PUA ranges among archive community. The idea would be that all participants in the archival effort would use common PUA character definitions. The implication of this is that individual users would need to ensure that content that they submit to the archive conform to the PUA usage defined for the archive, and that any new PUA characters they create would have to be incorporated into the common definitions.
Again, this is not necessarily a trivial problem to resolve. It would require considerable cooperation among participants, and potentially require that some body of people be appointed to control the addition of characters into the common definition. This would be done to avoid any duplication, to ensure that encoding space is not taken up with presentation forms, to catch any ill-conceived implementations and to recommend alternatives in such situations, to encourage cooperation between parties that might otherwise devise competing and incompatible representations for characters of shared interest, and to ensure that adequate documentation is provided. The biggest concern for such a group would be to limit what they take on, and to avoid duplicating all of the work done by the Unicode Technical Committee and the corresponding ISO committee.
Within the context of SIL, these are all matters that have decided that we need to solve for our internal use. With personnel working in projects involving some 1,100 languages around the world, we expect there will be numerous situations in which our personnel will require PUA characters. In several cases, they will need to encode entire scripts as PUA characters. We certainly hope to see most or all of these eventually added to Unicode, and we hope to be able to contribute to making that happen. Until then, however, use of PUA will be needed.
For individual users, there may not be a felt need to document PUA characters (unless that is needed to make software work). It is needed for our internal archives, however. More generally, it is needed whenever our personnel would want to exchange data with others outside of their local SIL context. We have considered our situation, and have decided on a plan that balances local need for flexibility with corporate need for common and documented usage. In this plan, a centralized group within our International Administration would assign characters that are thought to be of interest to many or all SIL language projects (for example, phonetic symbols) in the range U+F100 to U+F8FF. At the local level, our SIL field entities (like divisions in a large corporation that operate with some measure of independence) would be free to use PUA characters in the range U+E000 to U+EFFF in whatever manner they choose. (We have chosen to avoid the range U+F000 to F0FF since this is used by Microsoft for symbol-encoded fonts, and therefore there is the possibility of unpredictable behaviour in some applications.)
The latter convention allows for local flexibility, but not for inter-entity interchange or corporate archiving. To address these needs, we would do the following in addition: we request that field entities register any PUA character assignments they make with the centralized body, providing the required documentation. That body would then assign those characters to unique codepoints in the Plane 15/16 PUA range (U+F0000 to U+10FFFF). In doing so, we would unify these characters with identical characters that may have already been registered by another SIL entity. In the process, we would also provide recommendations back to the requesting entity regarding any encoding decisions that we felt might be problematic. Thus, every PUA character used within SIL would (we hope) have a single codepoint in Planes 15 or 16. Our documentation of the codepoint assignment in Plane 15 or 16 would be provided, at least in part, in the form of mapping tables between the Supplementary Plane PUA assignments used in the wider context and the BMP PUA assignment used within the particular SIL entity.
We have not yet really begun to implement this plan, and so cannot yet say how well it will work. To make it work, however, one of the problems we will have to solve is to define an appropriate schema (quite likely using XML) for documentation of PUA characters. Also, our users need to be able to use their PUA characters in our SIL language software. Thus, our software developers will be needing to define appropriate mechanisms to allow users to define PUA character properties, and to do so in a way so that definitions entered once are made available across all of our applications.
I mention the approach to these issues we are taking within SIL in hope that at least some of these ideas may be useful in relation to our common archival effort. It is also possible that solutions we implement may also be of use for the archival community.
I have covered a lot of material regarding Unicode, and have raised some specific concerns as well as some particular recommendations. I will summarize these here.
There are a number of aspects to Unicode that need to be understood by those creating content for a language and linguistic archive, and also by those developing implementations and standards for this archive. I will mention two in summary. First, it needs to be understood that Unicode encodes abstract units of textual information rather than directly encoding the graphemes used in individual orthographies. Secondly, the distinction between characters and glyphs must be understood, and also the fact that Unicode is intended to encode characters and not glyphs. It is recommended that archived data not be encoded using any presentation forms.
Consideration should be given regarding whether to allow multiple encoded representations of tone, or just one. Related to this, it may be desirable to propose that contour tone diacritics be added to Unicode.
Many superscript symbols needed for phonetic and phonemic transcriptions (and also for some transliteration conventions) are not yet supported in Unicode. There are alternatives for how these could be represented, though it is recommended that new characters be proposed for addition to Unicode.
Certain uses of markup should be deemed inappropriate. This probably applies to the specific possibility of using markup to represent meanings associated with superscripting in phonetic and phonemic transcriptions. More generally, recommendations made in [DUTR 20] should be adopted for use in this archiving project.
There is a concern particularly in relation to superscript characters in that these are compatibility characters in Unicode. The concern is that information essential for linguists would be lost if normalization using forms KD or KC is ever applied. Possible resolutions to this problem were considered, but it is not at all clear what the best solution is.
It is likely that, at some point in the future, W3C protocols and file formats will recommend or require early normalization of data to normalization form C. Accordingly, this should probably be recommended for purposes of this archive effort. More generally, all of the recommendations presented in [W3C CharMod] should likely be adopted.
Unicode allows for three different encoding forms for data—UTF-8, UTF-16 and UTF-32—each of which have advantages for different uses. Consideration must be given as to what recommendations are made for choosing among the various encoding forms for use within this archiving project. Whatever recommendations are made in this regard, content creators should always specify the encoding used for content.
A standard is needed for documenting Unicode private-use characters used in archived data. There is also a need for mechanisms in applications that are used in conjunction with the archive to allow character properties for private-use characters to be specified so that applications will know how to process those characters. Preferably, this would be done in a way in which such information is portable between applications. In addition, the archive community may wish to consider cooperating in using common definitions of Unicode private-use characters.
[RFC 2277] H. Alvestrand. 1998. IETF Policy on Character Sets and Languages. RFC 2277 / BCP 18, January 1998. Available online at <http://www.ietf.org/rfc/rfc2277.txt>.
[RFC 2119] Bradner, S. 1997. Key words for use in RFCs to Indicate Requirement Levels. RFC 2119 / BCP 14, March 1997. Available online at <http://www.ietf.org/rfc/rfc2119.txt>.
[XML] Bray, Tim; Jean Paoli; C. M. Sperberg-McQueen; and Eve Maler (eds.) 2000. Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation 6 October 2000. The current version of this W3C recommendation is available online at <http://www.w3.org/TR/REC-xml>.
[MathML] Carlisle, David; Patrick Ion; Robert Miner; and Nico Poppelier (eds.) 2000. Mathematical Markup Language (MathML) Version 2.0. W3C Candidate Recommendation 13 November 2000. The current version of this W3C recommendation is available online at <http://www.w3.org/TR/MathML2>.
[UAX 15] Davis, Mark, and Martin Dürst. 2000. Unicode normalization forms. Unicode standard annex #15. The current version of this Unicode technical report is available online at <http://www.unicode.org/unicode/reports/tr15/>.
[DUTR 20] Dürst, Martin, and Asmus Freytag. 2000. Unicode in XML and other markup languages. Draft Unicode technical report #20. W3C Working Draft 23-June-2000. The current version of this Unicode technical report is available online at <http://www.unicode.org/unicode/reports/tr20/>.
[W3C CharMod] Dürst, Martin J., and François Yergeau (eds.) 1999. Character Model for the World Wide Web. World Wide Web Consortium Working Draft 29-November-1999. The current version of this W3C working draft is available online at <http://www.w3.org/TR/charmod/>.
[IPA] The International Phonetic Association. 1999. Handbook of the International Phonetic Association: a guide to the use of the International Phonetic Alphabet. Cambridge: Cambridge University Press.
[HTML] Raggett, Dave; Arnaud Le Hors; and Ian Jacobs (eds.) 1999. HTML 4.01 Specification. W3C Recommendation 24 December 1999. The current version of this W3C recommendation is available online at <http://www.w3.org/TR/html401/>.
[TUS 3.0] The Unicode Consortium. 2000. The Unicode standard, version 3.0. Reading, MA: Addison-Wesley.
[UTR 17] Whistler, Ken, and Mark Davis. 2000. Character encoding model. Unicode technical report #17. The current version of this Unicode technical report is available online at <http://www.unicode.org/unicode/reports/tr17/>.