Unicode Character Encoding of Archived Linguistic Data

Peter Constable
SIL International
<peter_constable@sil.org>

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. For archived linguistic and language data, the Unicode™ standard is the best choice for character encoding. In building an infrastructure for online archives of linguistic and language data, there are some aspects of Unicode that need to be understood, and some outstanding issues regarding implementation of Unicode that need to be resolved. This paper will provide an explanation of those aspects of the standard that are of particular relevance, and will discuss the outstanding issues that require further consideration. 


1. INTRODUCTION

Two important purposes for published electronic data archives are to maintain valued information on a long-term basis, and to make that information available to a given audience. Corresponding to this, it is important that the encoding mechanisms in terms of which data are represented be documented as part of the metadata of the archive so that the data can be interpreted long after it was created. It is also important that these encoding mechanisms can be interpreted by the various software tools used by the consumers of the archive so that the archive is usable. This is true of character encoding as well as any other aspect of data encoding. 

In the past, the information technology (IT) industry has provided various character encoding standards based on 8-bit text processing technologies. Some of these made use of multi-byte sequences to represent characters so as to support large character repertoires. All of these legacy standards, however, were characterized by being limited in their repertoires: any one of them could support the writing systems of only a relatively small number of languages, and the combined character repertoire of all of them was sufficient to support most of the world's major languages, but not much more. In particular, many symbols used for phonetic transcription and many characters used for writing lesser known languages are not supported in any of the many legacy industry standard encodings. 

On the whole, linguists have dealt with this limitation in a resourceful manner by creating their own fonts, customizing them to support the particular character repertoires that they need. The result, however, has been hundreds if not thousands of different fonts using incompatible, non-standard encodings, and large quantities of linguistic data with undocumented encodings. Such data is meaningful only when accompanied by specific matching fonts.

The Unicode™ character encoding standard provides a solution to encoding problems that is ideally suited for the needs of archived language and linguistic data. In relation to long-term documentation needs, Unicode is a well-documented international standard that is expected to become the dominant character encoding standard within IT for the foreseeable future. It is also the default encoding that is assumed in certain important IT standards, such as XML. Because it is becoming the dominant standard, a wide variety of software products are being developed that support it. As a result, it will be usable by the widest group of users. Furthermore, it aims to support a universal character repertoire that is sufficient to represent the writing systems of all languages, modern and ancient, which makes it especially well suited to the needs of linguistic and language data. For these reasons, Unicode is, without question, the best choice for character encoding in a coordinated archive of linguistic and language data.

Simply deciding to use Unicode does not resolve all of the character encoding issues that need to be considered, however. In order to begin building an infrastructure for a cooperative online archive, in order to begin building the software tools for working with that archive, and in order to begin creating the content to populate that archive, there are some aspects of the Unicode standard that need to be understood, and some outstanding issues regarding how best to implement Unicode that remain to be resolved.

The purpose of this paper is to provide a brief tutorial on Unicode, focusing on those aspects that have particular relevance for the purposes of this workshop, and to describe the outstanding issues that need to be addressed. In some cases, I will provide recommendations as to how these issues might best be resolved, though some will remain for further consideration.

In explaining some of the details of Unicode, my intent is to keep the discussion on a non-technical level as much as possible. Some minimal amount of technical detail will be necessary, however. I do not assume that the reader already has any in-depth technical understanding of Unicode, though I do assume at least basic familiarity. For detailed technical information regarding the standard, the best source is the standard itself [TUS 3.0].

 

2. UNICODE OVERVIEW

2.1. The Place of Unicode in a Broader Technology Context

Unicode provides a means for encoding plain text character data. In a sense, then, it represents the lowest level in a three-level hierarchy of text data representation:

Level

Function

Examples

Metadata Documents the content of a file and the minimal information required in order for a process to begin interpreting the contents. file names, file extensions (e.g. ".DOC"), MIME type
Document encoding Encodes the higher-level structure of a document and the various streams of content that comprise it. XML, RTF, PDF, MS Word binary format
Character encoding Provides a digital representation of character data contained in the document. ASCII, ISO 8859-1, Unicode

(For purposes of the remainder of this discussion, I will assume document encoding is done in terms of a markup language such as XML.) In a text processing system, there must be components that deal with information on each level, and appropriate standards are necessary to define mechanisms for representation and processing on each level. Unicode does this only for the level of character encoding.

In practice, there is not always a clean division between the document and character encoding levels. This is especially true for markup languages, such as XML, that utilize character sequences as mechanisms for document encoding. Because of these mechanisms, some characters require special representation when used within the textual content of a document that is encoded using such a markup language. For XML, these issues are documented in the XML specification [XML].

In certain situations, some aspects of the semantic content of a document could potentially be encoded either at the character level, in terms of particular character sequences, or at the document encoding level, using some form of markup. For example, mathematical formulas use textual characters, but can also involve operators that require certain visual control for presentation purposes. It may not always be clear whether such semantics are best represented in terms of character sequences or in terms of markup.

To consider an example more closely related to languages and linguistics, some writing systems involve bidirectional text, in which portions are written right-to-left while other portions are written left-to-right. Writing systems based on Arabic script are typical examples: alphabetic characters are written right-to-left, but numbers are written left-to-right (when considering the most significant digit first). Furthermore, a multilingual document might contain runs of Arabic text embedded within spans of text in another language that is written consistently left-to-right. In certain situations, the directional properties of characters are not sufficient to provide the exact level of control of the text direction that a user may require. As a result, it is necessary to have some encoding mechanism to control the directional behaviour. In principle, this could be done using either character sequences or markup. Indeed, both Unicode and HTML provide mechanisms for this very purpose, and mixing both in a single document can result in ambiguous encoding states.

Some important cases in which Unicode control characters and markup control mechanisms may conflict are discussed in Draft Unicode Technical Report #20 [DUTR 20]. For the purpose of linguistic and language data archives, there may be special situations in which a choice needs to be made between character versus markup mechanisms for encoding particular aspects of semantic content. I will return to this matter in sections 3.1 and 3.2.

We have seen that Unicode as a character encoding standard interacts with other standards that relate to higher levels of text data representation. If we focus only on the level of character data processing and plain text, Unicode is again but one of a collection of technologies that interact.

It is useful in this regard to consider a five-part text-processing model:

In this model, encoding corresponds to the memory representation and storage of text. Input and rendering correspond to the most fundamental of text-based processes: generating the text (typically using a keyboard), and viewing/displaying the text. Analysis represents a collection of many secondary processes for working with text: sorting, case mapping, hyphenation, morphological parsing, etc. Conversion represents a supporting process of transforming data between one character encoding and another. 

The encoding component in this model has central importance since the processes in each of the other components are implemented in terms of the character encoding. If the character encoding is to be Unicode, then the various text processes must be implemented in terms of Unicode.

This presents significant implications for text archives that use Unicode: it is necessary to have keyboards that generate Unicode-encoded data. Likewise, it is necessary to have fonts that are base on Unicode encoding. (The use of Unicode also introduces rendering requirements that relate to complex script support. This is discussed further in section 2.3.) Similarly, applications that process text and any supporting components for analytical text processes must be implemented using Unicode. In addition, for as long as users need to work with text encoding in legacy encodings, they require tools for encoding conversion that can map between Unicode and other encodings.

It is important to see Unicode in the broader context of a complete text processing model. Practical implementation using Unicode requires all of these other pieces of the overall puzzle to be in place.


2.2. Unicode's Multi-tiered Character Encoding Model

It is often assumed that Unicode is a uniform-width 16-bit encoding. While this was part of the early design goals, the practical requirements of implementation necessitated a multi-tiered character encoding model. This character model is described in detail in Unicode Technical Report #17 [UTR 17]. I will describe here only the relevant details.

At the most basic level, Unicode defines a set of characters. For example, LATIN SMALL LETTER O WITH HORN. This set is referred to as an abstract character repertoire. At the next level, a coded character set, each character is assigned a unique integer value. For Unicode characters, these integers are referred to as Unicode scalar values, and are usually cited using the hexidecimal-based notation U+xxxx. For example, U+01A1 LATIN SMALL LETTER O WITH HORN. Unicode scalar values come from a possible range of U+0000 to U+10FFFF, with space for over 1 million characters.

At this point, it is important to understand that the encoded representation of a Unicode character is not necessarily the same as its Unicode scalar value. The next level in the model involves mapping Unicode scalar values into computer data types of a fixed size, and three such mappings have been defined: one using 8-bit (byte) values, one using 16-bit values, and one using 32-bit values. This level in the model is referred to as character encoding form, and the three encoding forms defined by Unicode are known as UTF-8, UTF-16, and UTF-32.

In UTF-32, Unicode scalar values are mapped by identity to 32-bit integer code units of the same value. 8- and  16-bit code units cannot directly represent over a million characters, however. Therefore, the mappings into UTF-8 and UTF-16 are more complex, using sequences of code units. The UTF-16 encoding form uses sequences of one or two 16-bit code units to represent a Unicode scalar value, while UTF-8 uses sequences of one to four 8-bit code units to represent a Unicode character.

The reason for having multiple encoding forms is one of practicality in implementation. Originally, only a 16-bit encoding form was envisioned, but an 8-bit form was needed for integration with existing 8-bit implementations, such as Unix file systems. Each encoding form has different advantages over the others, and each is considered preferable in different contexts.

Because there are three different encoding forms to choose from, it will be necessary to make choices regarding which to use in the context of online data archives. I will return to this in section 3.4.


2.3. The Character-Glyph Model, and "Smart Font" Rendering

It is important to understand that Unicode distinguishes between characters and glyphs. This distinction is analogous to that between a morpheme and an allomorph. A character is a unit of textual information. It carries semantic information and properties, and also has a typical or nominal shape associated with it. A glyph, on the other hand, is a particular image that is used to represent a character. 

There is not always a one-to-one relationship between characters and glyphs. A single character may appear in different shapes in different contexts (analogous to phonemes and allophones). So, for example, most Arabic letters appear in one of four shapes according to their position within a word: initial, medial, final or isolate. 

Character Glyphs

 

Is some cases, different characters can also be displayed using the same glyph. For instance, GREEK CAPITAL LETTER ALPHA can be displayed using the same glyph as (U+0391) LATIN CAPITAL LETTER A (U+0041), within certain limitations on the typeface (for example, a Fraktur glyph would not be appropriate for Greek alpha).

Also, a sequence of characters may merge into a single glyph, known as ligatures or conjuncts (analogous to portmanteau morphs). So, for example, the Devanagari syllable "ksha" is composed of a sequence of three characters, but is presented using a single glyph.

Character sequence Glyph

 

There can also be cases in which a single character corresponds to multiple glyphs (analogous to discontinuous morphemes). So, for example, the character BENGALI VOWEL SIGN O (U+09CB) corresponds to a pair of glyphs, one before and one following the character for the syllable-initial consonant.

 Character sequence Glyph sequence
< U+0995 BENGALI LETTER KA, U+09CB BENGALI VOWEL SIGN O >

In general, the relationship between characters and glyphs is many-to-many.

As a rule, Unicode encodes characters but not glyphs. For reasons of backward compatibility with existing encoding standards, it was necessary to make some exceptions to this rule. So, for instance, the contextual forms required for Arabic presentation are all directly encoded in Unicode because they had been directly encoded in one or more legacy industry standards. These presentation forms are convenient for rendering, but they make many other processes more complicated and less efficient. Thus, even though these Arabic presentation forms are encoded in Unicode, their use is discouraged.

Linguists are typically very familiar with the notion of directly encoding various presentation forms of characters. This was necessary in the past because software did not provide any other mechanisms for rendering text where complex script behaviours were involved. So, for example, the SIL IPA93 fonts directly encode four positional variants of the acute accent (o- and i-width overstrikes for both lower and upper case base characters). But linguists also want to perform analytical processes on their phonetic transcription data, and these presentation-form distinctions make that more difficult.

Unicode assumes that distinctions of this sort that pertain to rendering do not belong in the character encoding. The standard was designed with the assumption that it would be implemented in software that would address the needs of rendering entirely within the rendering process. This requires what is often referred to as smart font rendering technologies. Such technologies are covered in other papers from this workshop, so I will not elaborate on them here.

The significance of the character-glyph model for purposes of linguistic and language data archives is this: it is recommended that data be encoded without directly encoding presentation forms. This assumes that those who contribute to and make use of the archives will have access to software that provides the necessary rendering support. This is a concern in the short term, but such technologies are beginning to become available, and are expected to be widely available within a few years.


2.4. Abstract Encoded Characters for Representing Orthographic Characters

In order to use Unicode as the character encoding for linguistic and language data archives, those creating the content that will go into the archive need to know how to encode the characters in their writing systems in Unicode. In this regard, it is important to understand that the Unicode character repertoire is not based on a direct representation of orthographies. Rather, it uses a slightly abstract notion of characters that is language- and writing-system neutral. Unicode uses the notion of abstract character, which is defined as a "unit of information used for the organization, control, or representation of textual data." [TUS 3.0, p. 40.] Thus, a grapheme that constitutes a functional unit within the writing system and orthography of some language will not necessarily correspond to a single, distinct Unicode character. In many cases, a grapheme may be represented by a combination of the representational units from Unicode's abstract character repertoire. 

This distinction is important to understand. New users have often concluded that Unicode does not support certain characters that they need without understanding that these characters are supported in some way that the user wasn't aware of.

First, in this regard, Unicode allows dynamic and productive use of combining marks. For example, suppose the orthography for some language includes a letter "c with tilde". Unicode contains many precomposed Latin letter-with-diacritic combinations, but not this particular combination. Nevertheless, this grapheme is supported in Unicode using a character sequence:

Grapheme Unicode character sequence

< U+0063 LATIN SMALL LETTER C, 
U+0303 COMBINING TILDE >

 As mentioned, use of combining marks is productive. It is possible to represent multiple diacritics, if needed:

Hypothetical combination Unicode character sequence

 

< U+0063 LATIN SMALL LETTER C, 
U+0324 COMBINING DIAERESIS BELOW
U+032A COMBINING BRIDGE BELOW
U+0303 COMBINING TILDE, 
U+0306 COMBINING BREVE, 
U+0301 COMBINING ACUTE ACCENT >

Secondly, many requests have been made to encode digraph characters on the basis that "this is a separate letter in our language and has its own place in the sort order." A familiar example would be Spanish "ch": both "c" and "h" occur as separate graphemes in Spanish orthography, but "ch" is traditionally treated as a separate letter, and sorts between "c" and "d" (i.e. "cha" would sort after "cu" rather than before "ci"). Sorting is a text process that relies on but is distinct from encoding, however. It has been possible for years to sort Spanish words in traditional order without needing to encode a distinct "ch" character. 

It is important that these matters be understood not only by those creating the content that makes up an archive, but also by those developing software for use with the archived data. For example, developers need to understand that counting "characters" (in the orthographic sense) is not merely a matter of counting Unicode characters. In general, there is a many-to-many relationship between Unicode characters and graphemes, and the relationship is potentially language-dependent.

A third potential problem in recognizing how characters are to be encoded results from not understanding the intention of the character charts provided in the Unicode standard: the glyphs shown for each character are intended as representative glyphs. In some cases, two different languages can require different shapes for graphemes that would be represented using a single Unicode character. For example, two different shapes for upper case "eng" are preferred by different language communities in Papua New Guinea:

 Upper & lower "eng" (variant 1) Upper & lower "eng" (variant 2)

The character charts published by Unicode show only the second variant for the upper case eng. Someone looking for the first shape might conclude that a new character needs to be added to Unicode. Shape variations from one language to another are not a sufficient basis for character distinctions in Unicode, however. In both situations, the upper case letter is encoded in Unicode as U+014A LATIN CAPITAL LETTER ENG. The difference in glyphs is handled in the rendering process, possibly using different fonts or using language-based glyph selection (if supported by the software being used).   

Finally, in a small number of instances, Unicode includes what might initially appear to be duplicate characters. In these cases, a user might easily choose the wrong character. For instance, the following characters have the same shape, but are different:

U+2019 RIGHT SINGLE QUOTATION MARK

U+02BC MODIFIER LETTER APOSTROPHE

These two characters function differently: the first is a punctuation character, and is not word-forming, while the second functions as a word-forming letter. In Unicode, these two characters are distinguished by character semantics.

Unicode character semantics refer not to how characters are interpreted linguistically in terms of the phonology of a given language, but to how characters relate to other characters and to how they behave in relation to text processing. Unicode defines a variety of properties for every character, many of which are provided to control how characters behave in relation to processes such as line breaking. These properties are an intrinsic part of Unicode characters, and to a large extent, it is the semantic properties that define a Unicode character.

It is important to understand in this that Unicode specifies character properties of two types: normative and informative. Normative properties are a formal part of the definition of the standard, and are mandatory for implementations that follow the standard. Informative properties, on the other hand, are helpful guides, but are not required to be followed, in some cases because they are not appropriate for every situation. For example, U+0069 LATIN SMALL LETTER I has the normative property of being a lower case letter. Another property for this character is an upper-case mapping to U+0049 LATIN CAPITAL LETTER I, but that property is an informative property. 

Character U+0069 LATIN SMALL LETTER I
Category (normative) lowercase letter
Upper case mapping (informative) U+0049 LATIN CAPITAL LETTER I

The reason is that different case mappings may be required for particular languages. Such is true in the instance of Turkish, for which the lower-case dotted "i" maps to an upper-case dotted "i", U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.

Typical case mapping Turkish case mappings

Informative properties have secondary importance in defining a character. Thus, if a user is evaluating Unicode characters to determine how to represent particular graphemes, informative properties can help them to understand the typical use of a Unicode character, but if these properties do not match the behaviours of a given grapheme, that does not mean that this character cannot be used to represent the grapheme. The user should look particularly closely at the normative properties, however, and if a normative property does not match for the behaviour of a given grapheme, then that indicates that a different Unicode character is needed. (For more information on Unicode character properties, see chapter 4 of [TUS 3.0].)

We saw earlier that shape was not a sufficient basis for distinguishing Unicode characters, and now we have seen that it is also not necessary for distinguishing characters. A character distinction can be justified if two candidates are used and distinguished within a given writing system, or if they differ in their normative character properties, whether or not they co-occur in any single writing system. 


2.5. Canonically Equivalent Representations and Normalization

We have seen that Unicode allows for dynamic and productive composition of characters using combining marks. It was also mentioned that Unicode aims not to encode presentation forms. Taking these points together, one may find it surprising that Unicode includes a number of precomposed base character-plus-diacritic combinations. In principle, if a rendering system can deal with dynamic composition, then it is not necessary for any precomposed combinations to be directly encoded. For reasons of backward compatibility with existing industry standards, however, it was necessary to include a number of precomposed combinations in Unicode.

In section 2.3, it was mentioned that several Arabic presentation forms were also encoded in Unicode for backward compatibility with existing standards. There is a significant difference, however, between an Arabic presentation form, such as U+FEEA ARABIC LETTER HEH FINAL FORM, and a precomposed base-diacritic combination, such as U+00E1 LATIN SMALL LETTER A WITH ACUTE. Note that the Arabic presentation form is not unconditionally interchangeable with its counterpart, U+0647 ARABIC LETTER HEH. For example, ARABIC LETTER HEH can occur in any word position, but not ARABIC LETTER HEH FINAL FORM. In contrast, the precomposed combination LATIN SMALL LETTER A WITH ACUTE is unconditionally interchangeable with its decomposed counterpart, the combination <LATIN SMALL LETTER A, COMBINING ACUTE ACCENT>. In cases in which two character sequences are unconditionally interchangeable in this manner, Unicode asserts that these sequences are fully synonymous and equivalent character representations.

In Unicode, this relationship of synonymy is referred to as canonical equivalence. This relationship is specified as a normative property of characters. For example, the character U+00E1 LATIN SMALL LETTER A WITH ACUTE has as one of its normative properties the canonical decomposition mapping to the sequence <U+0061 LATIN SMALL LETTER A, U+0301 COMBINING ACUTE ACCENT>. Because of this, U+00E1 is canonically equivalent to the sequence < U+0061, U+0301 >, and the two representations are considered to be fully synonymous and equivalent.

In a small number of instances, a Unicode character has a canonical decomposition of another apparently identical character. For example, U+037E GREEK QUESTION MARK has a canonical decomposition of U+003B SEMICOLON. Apart from the name and decomposition, these characters are identical. As with the previous cases, these have been included in Unicode to maintain backward compatibility with source industry standards in which these character distinctions existed.

Given that Unicode allows for a text to have alternate, equivalent representations, there is a potential problem for processes that do any type of string comparison. For example, a user may be searching for LATIN SMALL LETTER A WITH ACUTE whereas the document may contain instances of the combination < LATIN SMALL LETTER A, COMBINING ACUTE ACCENT>. In order to deal with this problem, Unicode defines certain normalization forms that remove these artificial distinctions. 

As described in Unicode Standard Annex #15 [UAX 15], artificial distinctions related to canonical equivalence are handled using either of two normalization forms: normalization form D (NFD) and normalization form C (NFC). The normalization for NFD specifies that the text be transformed into its maximally decomposed representation, so that no remaining character has a canonical decomposition. NFC is conceptually the opposite: text in normalization form C is represented using precomposed characters as much as possible. (See [UAX 15] for details regarding how each normalization form is generated.) Both normalization forms provide unique representations for a text—for any string, there is only one decomposed normal form and one precomposed normal form for that string. Thus, either can serve as a reference point for comparison with other strings.

When multiple combining marks co-occur on a base character, the combining marks may or may not interact typographically with one another. For example, if two marks occur above the base character, they must be positioned in some order relative to one another, and these differences in relative position can affect the meaning of the text. In Unicode, the relative order would be encoded in terms of the order of the character codes in the file:

< U+0061 LATIN SMALL LETTER A, 
U+0302 COMBINING CIRCUMFLEX ACCENT, 
U+0303 COMBINING TILDE >

< U+0061 LATIN SMALL LETTER A, 
U+0303 COMBINING TILDE, 
U+0302 COMBINING CIRCUMFLEX ACCENT >

In contrast, if two combining marks do not interact typographically, there is no possibility of changing the meaning of the text by changing relative positioning. Nevertheless, the two combining characters must come in some relative order in the data stream. Thus, we have another situation in which different character representations can mean the same thing.

Processes need to know whether or not the relative ordering of combining characters within a data stream have any semantic significance. This is handled by assigning every combining mark to a particular combining class, which is a normative property of characters. In situations where the relative ordering of combining characters is not semantically significant, the artificial distinction reflected in the ordering difference is eliminated as part of the normalization process. This is true for each of the normalization forms defined by Unicode.

Thus, a string may have a number of different equivalent representations that differ in terms of precomposed or decomposed representations and in ordering of combining characters. As mentioned above, however, there is a unique NFD representation, and a unique NFC representation. This is illustrated in the following example:

Visual appearance

Encoded character sequence < U+1EA0 LATIN CAPITAL LETTER A WITH DOT BELOW, U+0306 COMBINING BREVE >
Alternate, equivalent representation < U+0102 LATIN CAPITAL LETTER A WITH BREVE, 
U+0323 COMBINING DOT BELOW >
Second alternate, equivalent representation < U+0041 LATIN CAPITAL LETTER A, 
U+0306 COMBINING BREVE, 
U+0323 COMBINING DOT BELOW >
Equivalent representation in NFD < U+0041 LATIN CAPITAL LETTER A, 
U+0323 COMBINING DOT BELOW, 
U+0306 COMBINING BREVE >
Equivalent representation in NFC U+1EB6 LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW

Since it is possible for text to be represented in several different but equivalent ways, this raises an issue for data held in online archives. This matter will be discussed further in section 3.3.


2.6. Compatibility Characters

While an Arabic presentation form such as U+FEEA ARABIC LETTER HEH FINAL FORM is not unconditionally interchangeable with the corresponding character U+0647 ARABIC LETTER HEH, there clearly is a close relationship. If the presentation form were to be replaced with the other character, there is some minor meaning lost that pertains to rendering, but there is also important meaning that is retained. Note that the information that relates to rendering would be redundant if adequate rendering support is available, but would be necessary otherwise. Here we have a case of near synonymy: in many situations, a user would not see any need to distinguish the two characters, but in some situations the distinction may be important.

There are 2,172 characters in Unicode of this sort—near-synonyms of other characters. These characters were all included for purposes of backward compatibility with source industry standards and are known as compatibility characters. The relationship between a compatibility character and its near synonym is captured in Unicode in the same manner as canonical equivalence: a normative decomposition mapping is provided for compatibility characters. This type of decomposition is known as a compatibility decomposition, as opposed to a canonical decomposition. Compatibility decompositions are distinguished from canonical compositions in that the mapping includes some kind of non-character information, as shown in the following examples:

Compatibility character Decomposition

 

U+FEEA ARABIC LETTER HEH FINAL FORM <final> 0647

 

U+FF41 FULLWIDTH LATIN SMALL LETTER A <wide> 0061

 

U+00B9 SUPERSCRIPT ONE <super> 0031

 

U+02B0 MODIFIER LETTER SMALL H <super> 0068

 

U+210E PLANCK CONSTANT <font> 0068

 

U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING <compat> 0061 02BE

 

U+0E33 THAI CHARACTER SARA AM <compat> 0E4D 0E32

In the case of Arabic presentation forms, the non-character information has to do with contextual rendering behaviour. This is not always the case, however, as can be seen in the examples above. As with the Arabic presentation forms, the non-character information will not be important in many situations, but in certain situations it may be considered important.

When text is normalized into normalization forms D and C, the distinction between compatibility characters and their near synonyms is retained. Unicode provides two other normalization forms, however, in which these distinctions are eliminated: normalization form KD (NFKD) is a decomposed form without compatibility characters, and normalization form KC (NFKC) is a precomposed form without compatibility characters. As with forms NFD and NFC, any string has a unique NFKD or NFKC representation which can be used as a reference for comparison with other strings.

Since forms NFKD and NFKC do not contain any compatibility characters, normalizing a string into one of these forms will result in some loss of information unless that information is retained in some other way, such as markup or as out-of-band information. It is important that these normalization forms be used with care so as not to lose information that may be important in a given situation. This has some relevance for archived linguistic data. This matter will be considered further in section 3.1


2.7. Private-Use Characters

In some situations, users may find that the characters they need are not supported in the current version of Unicode. This would be true in the case of less common scripts that have not yet been added to the standard, such as Vai, Silheti Nagri or Lanna, or in the case of uncommon characters from scripts that are already well covered in Unicode. There can also be situations in which a character is not a candidate for addition to Unicode, at least at the present time. This would be the case, for example, with novel Ethiopic characters that are being considered for use in a particular language community but that are not yet stabilized in usage. Whenever characters are candidates for addition to Unicode, the Unicode Consortium welcomes having input into the further development of the standard in the form of proposals for new characters. Processing such requests does take time, however, especially if sufficient information about the proposed characters is slow in being provided. In all of these situations, users may have a need to create custom character assignments. Unicode reserves space in its encoding known as Private Use Area (PUA) ranges specifically for this purpose.

There are two PUA ranges. The first is in the so-called basic multilingual plane (BMP), from U+E000 to U+F8FF. The second is in the so-called supplementary planes from U+0F000 to U+1FFFF (planes 15 and 16). It is guaranteed that no characters will ever be assigned in these ranges as part of the standard, and individual users or corporations are free to use any of the codepoints in these ranges for proprietary purposes.

An important limitation of PUA characters is that their meaning is unknown apart from any specific, and separate, documentation. This is exactly the same problem that exists with custom fonts that use custom character sets. One particular concern is that applications may not have any way of knowing what PUA characters mean and so will not provide the behaviours and results that the user expects. These issues are discussed further in section 3.5.

 

3. OUTSTANDING ISSUES

3.1. Some Specific Character Encoding Issues

There are some particular encoding issues that are relevant for establishing an online archive of language and linguistic data that require further consideration. These have to do with the representation of tone in phonetic and phonemic transcription, with superscript letters for phonetic and phonemic transcription, and with the use of compatibility characters in general. I'll consider each of these in turn.

In phonetic and phonemic transcription, IPA [IPA, pp. 13ff] allows for the use of either tone letters or tone diacritics. All of the tone letters are currently supported in Unicode. Note that tone contours are represented by sequences of these characters, with the assumption that a "smart font" rendering system will display these using ligature glyphs corresponding to the various contours. 

Tone letter Unicode representation
U+02E5 MODIFIER LETTER EXTRA-HIGH TONE BAR
U+02E6 MODIFIER LETTER HIGH TONE BAR
U+02E7 MODIFIER LETTER MID TONE BAR
U+02E8 MODIFIER LETTER LOW TONE BAR
U+02E9 MODIFIER LETTER EXTRA-LOW TONE BAR

< U+02E7, U+02E8, U+02E5 >

(Some prefer tone letters with vertical bars on the left. This is simply a font variation and does not require distinct encoding.) Some tone diacritics are currently supported in Unicode, but not all:

Tone diacritic Description Unicode representation
Extra high level U+030B COMBINING DOUBLE ACUTE ACCENT
High level U+0301 COMBINING ACUTE ACCENT
Mid level U+0304 COMBINING MACRON
Low level U+0300 COMBINING GRAVE ACCENT
Extra low level U+030F COMBINING DOUBLE GRAVE ACCENT
High rising contour (not supported) 
Low rising contour (not supported)
Rising-falling contour (not supported)

In addition, some like to use superscript numbers to indicate pitch levels. Unicode already supports a complete of superscript digits, 0 to 9:

Superscript digit Unicode representation
0 U+2070 SUPERSCRIPT ZERO
1 U+00B9 SUPERSCRIPT ONE
2 U+00B2 SUPERSCRIPT TWO
3 U+00B3 SUPERSCRIPT THREE
4 U+2074 SUPERSCRIPT FOUR
5 U+2075 SUPERSCRIPT FIVE
6 U+2076 SUPERSCRIPT SIX
7 U+2077 SUPERSCRIPT SEVEN
8 U+2078 SUPERSCRIPT EIGHT
9 U+2079 SUPERSCRIPT NINE

Tones could also be represented using these characters. One concern with the superscript digits, however, is that these are compatibility characters, and the fact that they are superscripted would be lost when data is normalized into forms NFKD and NFKC. I will return to the problem of compatibility characters later in this section.

One possible concern of having different representations for tones is that they would not equated in string comparisons. If searching for certain tones in a text, a user would have to search for multiple representations. To avoid this problem, it could be decided to adopt one representation as a best common practice (BCP) for encoding of tones. The other alternative would be to define a character folding mapping that removes this distinction and persuade developers to incorporate support for this folding into their software.

If a single representation for tones is to be adopted as a BCP recommendation, the best choice for this would be the tone letters, U+02E5 to U+02E9. These are preferable for several reasons:  these are already supported in Unicode and are sufficient for all tones, they are not compatibility characters, and they are specifically intended for representing tones and so are unambiguous. If it were considered important to be able to display data using one of the other representations, it would be possible to handle this by transforming to one of the other representations as data is served up from the archive and delivered to the user. It could also be handled in the rendering process, for example by using a font feature to alter the rendering behaviour of the font to show superscript numbers or tone diacritics even though the underlying characters are the tone letters.

In creating an infrastructure for a cooperative archive for linguistic data, it should be decided whether or not to make a BCP recommendation as to how tone should be represented. For the reasons described above, I would suggest that a consistent representation be adopted using the tone letters.

If such a BCP recommendation is made, consideration will have to be given as to whether it is necessary to present data using one of the other conventions for representing tone, and, if so, what mechanisms would be used to implement this. (If this were to be implemented using font features, it may be necessary to request that new features be defined for the various "smart font" formats—OpenType, AAT and Graphite—in order to handle this rendering transformation.)

If multiple representations for tone are to be allowed in data, it would probably be important to request that the tone diacritics for contour tones be added to the Unicode standard. (The linguistics community may want to do this even if it is decided not to use the tone diacritics in the online archive.) The matter of character folding for comparison purposes would also deserve further consideration.

The 1996 revision of IPA as specified in the IPA Handbook [IPA] is an important convention for phonetic and phonemic transcription, and most of the symbols are already supported in Unicode. There are a few symbols that are not currently supported, however, in addition to the contour tone diacritics mentioned above. Also, there are a number of other traditions for phonetic and phonemic transcription, and some of the symbols used in those various traditions are not currently in Unicode. Proposals can be submitted to add some or all of these symbols. I want to focus briefly on the use of superscript symbols in particular.

IPA allows for the use of various superscript letters, to indicate aspiration, for example. The following are already supported in Unicode:

Superscript letter Description Unicode representation
aspirated U+02B0 MODIFIER LETTER SMALL H
palatalized U+02B2 MODIFIER LETTER SMALL J
labialized U+02B7 MODIFIER LETTER SMALL W
velarized U+02E0 MODIFIER LETTER SMALL GAMMA
  pharyngealized U+02E4 MODIFIER LETTER SMALL REVERSED GLOTTAL STOP
nasal release U+207F SUPERSCRIPT LATIN SMALL LETTER N
lateral release U+02E1 MODIFIER LETTER SMALL L
voiceless velar fricative release U+02E3 MODIFIER LETTER SMALL X

Some superscript letters used in earlier revisions of IPA or in other traditions are also supported. A few examples follow:

Superscript letter Description Unicode representation
  breathy voiced, murmered U+02B1 MODIFIER LETTER SMALL H WITH HOOK
palatalized U+02B8 MODIFIER LETTER SMALL Y
ejective U+02C0 MODIFIER LETTER GLOTTAL STOP

Unicode does not currently support two of the superscript letters in IPA. Various superscript letters used in other traditions are also missing. Superscript nasals used to show prenasalization are a familiar example:

Superscript letter Description
voiceless dental fricative release
mid central vowel release
prensalization

(Note that some of these may also used in transliteration traditions for various scripts. For example, the superscript schewa is used in transliterations of Hebrew to indicate voiceless (elided) Hebrew schewa pointing.) In some traditions, superscripts can also be used to indicate other secondary articulations, such as an oral stop onset to a nasal stop:

Symbols Description
voiced bilabial nasal with oral onset

Some traditions also use superscripts to reflect phonemic interpretations, for example to indicate that a vowel quality is an off-glide of a diphthong phoneme:

Symbols Description
open-mid front unrounded vowel with near-close near-back rounded off-glide

In principle, within such traditions, superscript versions of a large number of consonant and vowel symbols could potentially be used for purposes such as these. There is potential need, therefore, for a significant number of additional superscript letters to be supported in Unicode.

I highlight these potential character needs precisely because the existing superscript characters in Unicode are compatibility characters, thus there is potential for the information encoded in terms of the use of superscripting to be lost under normalization. The linguistics community in general, and a cooperative defining encoding policies for a cooperative archive, needs to decide what superscript phonetic symbols they need to be able to represent in Unicode-encoded documents, and what the best representation would be.

The immediately obvious means of encoding these superscript symbols is by adding each of them as distinct to Unicode. This has the potential disadvantage of being a compatibility character, but the benefit of allowing for simple implementations, and of using a mechanism that is consistent with the existing recommendations in the IPA Handbook for encoding other superscript symbols.

One alternative would be to use a single control character that combines with another phonetic character to indicate that that character is to be superscripted. For instance, something along the lines of < "b", U+xxxx ZERO-WIDTH SUPERSCRIPT MODIFIER, "m" > could potentially be used to represent a nasal with oral onset. This solution has significant drawbacks, however: it requires a non-trivial and non-transparent implementation involving a new and ad hoc mechanism. There is also potential for implementations to abuse the mechanism, applying it to situations for which it was not intended. (Such a proposal would not likely be given serious consideration by the Unicode Consortium.)

Another possibility is simply to use formatting controls within an application. This should not be done, however. Formatting controls generally require the use of proprietary file formats, which are not good choices for archived data. Furthermore, such out-of-band information is not readily accessible to analytical processes, such as searches, and it is lost when text is exported to plain text. 

Yet another possibility involves markup: given that "superscript" can be construed as non-character information, it might not be unreasonable to represent these semantics using markup. So, for example, something like the following might be a possible representation for a nasal with oral onset:

... oro<doublearticulation><secondary>b</secondary>m</doublearticulation>a...

This would be more appealing if similar markup representation were used for other elements of phonetic transcription, such as prosodic elements of an utterance. This might seem an awkward way to represent phonetic transcriptions to some, though perhaps not to anyone familiar with [MathML] markup. 

There are disadvantages, though. The most obvious is that this is a somewhat awkward representation for something that is typically perceived as text that doesn't have particularly complex layout issues and has not generally been viewed as representing much in the way of internal structure. (Mathematical formulas are markedly different in both of these regards, which is why a rich markup language makes sense for that domain.) If linguists decided that they wanted to record phonetic transcriptions that included significant structural information of some sort, a markup language might be a good way to do this. For conventional phonetic transcription, however, there is no particular need for markup except, perhaps, as a means to handle the particular issue of superscript symbols. 

Furthermore, even if markup were devised for phonetic and phonemic transcription, it is not clear how well this could be adapted for other uses, such as with Hebrew transliteration. Those uses will generally require very few superscript characters, which makes the use of markup even less appealing. 

Additional disadvantages are that it would make it more difficult for users to enter data, and it would require a non-trivial and novel implementation that uses a different mechanism than is used for the existing recommendations in the IPA Handbook for encoding other superscript symbols (unless those were to be abandoned in favour of a markup mechanism).

In view of the pros and cons, the best solution for representing additional superscript symbols appears to be to request that they be added to Unicode as distinct characters, in spite of the fact that they would be compatibility characters with "superscript" treated in the decompositions as non-character information. This leaves us in a confusing situation, though, since compatibility characters are generally to be avoided (see below).

The question remains, though, as to just what additional superscript symbols are required for use in archived linguistic data. This issue needs to be resolved by the linguistic community that will be participating in the distributed online archives we are attempting to develop.

I have considered the need for superscript letters, which are a specific case in the more general class of compatibility characters. There are many other sub-classes of compatibility characters. Some of these might be acceptable for use in archived data, given the same caveats expressed for superscripts. In general, however, compatibility characters should not generally be used. I will return to this in section 3.3.


3.2. Character versus Markup Representation

As mentioned in section 2.1, character encoding and document encoding are distinct levels of information representation, but there can be situations in which semantic content might reasonably be represented in either level, either in terms of markup or in terms of encoded characters. For example, paragraphs can be delimited using characters such as carriage return (U+000D) or U+2029 PARAGRAPH SEPARATOR, but this is typically done in markup languages using markup, such as <p>... </p>.

These issues are discussed in general in Draft Unicode Technical Report #20 [DUTR 20]. That document also makes recommendations for some specific situations in which markup-based control is to be preferred over character-based control. It is recommended that those recommendations be adopted for online archives. 

I cannot anticipate all of the possible points in the development of document standards for archived linguistic data at which we may need to choose between character- or markup-based representation. I will discuss two points very briefly.

The first has to do with superscript characters. As mentioned in the previous section, one possible solution to concerns over the use of compatibility characters would be to devise markup to represent semantics that would otherwise be represented in terms of compatibility characters. Various disadvantages to such a solution were pointed out. I would point out here some additional observations drawn from [DUTR 20, section 2]. Encoding text as a sequence of characters results in a linear structure. In contrast, markup provides a hierarchical structure. Markup is more appropriate, therefore, for representing information that is hierarchical in nature, whereas characters are more appropriate for information that is linear in nature. In addition, markup is suited to control over spans of information is required, but is not very efficient where very local control is needed. Indeed, markup that is used in local contexts can be intrusive and get in the way of processes that operate on the character data. Where local control is needed, character codes are, in general, preferable. Phonetic and phonemic transcription is essentially linear as opposed to being hierarchically structured. (This is true, at least, in the ways in which it has been done in the past and in which IPA has been designed.) Also, superscripting is applied in very local contexts. In view of the preceding observations, both of these factors argue in favour of character-based representation for semantic information embodied in superscript phonetic symbols, and against the use of markup. 

The second has to do with annotated, interlinear text. Some may have observed that Unicode has three characters that are specifically intended for interlinear text annotations:

U+FFF9 INTERLINEAR ANNOTATION ANCHOR
U+FFFA INTERLINEAR ANNOTATION SEPARATOR
U+FFFB INTERLINEAR ANNOTATION TERMINATOR

Given linguists' intererst in annotated, interlinear text, some might wonder if these characters are somehow intended for use with the kind of interlinear text that linguists are accustomed to dealing with. I simply wish to point out that these characters are not intended for this purpose. They are intended for a specific type of annotation text typically used for Japanese that is known as furigana. (These are short annotations beside kanji ideographs to indicate pronunciation.) Furthermore, these characters are specifically intended only for internal use by software processes, and not for storage or transmission of text. These characters should not, therefore, be used in archived data. (Note that this is one of the specific recommendations made in [DUTR 20].)


3.3. Normalized Representation for Online Archival Data

Given that some character sequences have alternate, synonymous representations, we need to consider what the implications are for archived data, and whether a single, normalized representation should be used. 

The decomposed and composed each have benefits in different situations. The decomposed representation is probably preferable for many types of analytical processing. For example, writing a search process that ignores diacritics is easier it the data can be assumed to be decomposed. Decomposed data also has an internal consistency in how it represents characters, which has a certain appeal. On the other hand, it is not difficult for an application developer to include a normalization step that precedes analytical processes that require decomposed data. In addition, many existing implementations exist that work with the composed representation for many Latin base-diacritic combinations. For example, the composed forms for European languages are used in Microsoft Windows.

There is an important argument in favour of using NFC: it is the recommendation made in a W3C working draft, Character Model for the World Wide Web [W3C CharMod, sections 4, 4.1]:

Character data interchange using W3C protocols and formats is based on the principle of early normalization, which defines the exact form to which text data has to be normalized, and the cases in which normalization must be applied... Text data is in normalized form according to this specification if all of the following apply:

(Note: the characters that are discouraged from use are enumerated in [DUTR 20]. That document is still in draft status, so it is possible that the list of characters to be avoided may change. I think it is unlikely that any changes will be made in relation to characters already in version 3.0 of Unicode, however.) The intent regarding early normalization is that data should be normalized into NFC at the earliest possible opportunity. The current thinking is that any recipient should be able to assume that data is in NFC.

This recommendation is intended to apply to all protocols and file formats for the World Wide Web. This includes HTML and XML. It is addressed to the authors of those specifications, rather than to users and developers; that is, the expectation is that the authors of the specification for, say, XML, would add these requirements into that specification. This is still a working draft, and other specifications such as [XML] do not currently reference it. Thus, we could potentially disregard it. We should perhaps anticipate the possibility that it will become a requirement by other standards in the future.

We have considered the choice between decomposed and composed normalizations. The issue of compatibility characters remains. It would be an option to consider recommending not merely NFC be used, but that the more restrictive NFKC be used. (Note that data in NFKC is, by definition, also in NFC.) In this regard, it should be noted that [TUS 3.0] generally discourages the use of compatibility characters. For example, "As with other compatibility characters, the preferred Unicode encoding is to use the nominal counterparts of these characters and use rich text font or style bindings to select the appropriate glyph size and width" (p. 274). Its statements are not entirely consistent, however:

Compatibility characters are those that would not have been encoded (except for compatibility) because they are in some sense variants of characters that have already been coded... 

Identifying one character as a compatibility variant of another character implies that generally the first can be remapped to the other without the loss of any information other than formatting. Such remapping cannot always take place because many of the compatibility characters are in place just to allow systems to maintain one-to-one mappings to existing code sets. In such cases, a remapping would lose information that is felt to be important in the original set... Because replacing a character by its compatibility equivalent character or character sequence may change the information in the text, implementation must proceed with caution. A good use of these mappings may not be in transcoding, but rather in providing the correct equivalence for searching an