Introduction to Chinese Information Processing

This page introduces you to some basic concepts on information processing of Chinese. For an in-depth understanding, you are encouraged to read the book "CJKV Information Processing" by Ken Lunde, published by O'Reilly and available at Amazon.com. Much of the information on this page is based on that book.

Chinese Character Set Standards

With tens of thousands of Chinese characters, character set standards are much need to ensure that only a minimum number of them are learned to facilitate effective communication. The English alphabet, which specifies 52 upper- and lower-case letters, is perhaps the simplest character set standard.

Computer systems make use of coded (electronic) character set standards, which are usually larger than non-coded character sets. The ASCII character set standard, for example, has 42 printable characters in addition to the 52 upper- and lower-case letters.

Row-Cell is an often encountered term when dealing with character sets of Chinese and other East Asian languages. A Row-Cell provides an index for a character and usually consists of four decimal digits - both the Row and Cell portions consist of a two-digit number ranging from 1 to 94. It is a very useful notation when you compare different character set standards and different encoding methods.

Character set standards are usually established by a government or a government-sanctioned organization in a given country or region. Several character set standards have been established in different regions of the greater China. GB (and its variations) and BIG 5 (and its variations) are two best-known standards in Chinese speaking communities.

GB is abbreviated from GuoBiao, which in turn is abbreviated from GuoJia BiaoZhun, meaning "national standard." Most references to GB mean the GB 2312-80 character set standard, established by mainland China in 1981 to represent simplified Chinese characters. Subsequent corrections and extensions include GB 6345.1-86, GB 8562.2-88, and ISO-IR-165:1992.

GB 2312-80 enumerates 7,445 characters, including 6,763 Hanzi and 682 non-Hanzi characters. Hanzi characters are divided into two levels - characters in level 1 (3,755) are ordered by Pinyin whereas those in level 2 (3008) are ordered by radical, then strokes.

GB/T 12345-90 , where T stands for the initial letter of Tuijian ("recommended"), is a 'traditional analog' of GB 2310-80 in that simplified characters in GB 2310-80 are replaced with their traditional counterparts.

GBK, where K stands for the initial letter of Kuozhan ("expanded"), is a character set aligned to ISO 10646-1: 1993 (Unicode Version 1.1). It is backwards compatible with GB 2310-80 as every character in GB 2310-80 is at the same Row-Cell point in GBK. The Simplified versions of Microsoft Windows 95 and Windows 98 use GBK internally.

In addition to mainland China, Singapore also uses the GB 2312-90 character set standard.

BIG 5, whose name refers to the five companies that collaborated in its development, was established in 1984 and has become the de facto official character set standard in Taiwan. BIG 5 has a much larger character set than GB 2310-80; it specifies 13,494 characters, including 13,053 Hanzi and 441 non-Hanzi. Like GB 2310-80, it also has two levels of Hanzi, but they are all ordered by radical then by increasing total number of strokes.

BIG5+ (BIG 5 Plus), developed by a number of companies in Taiwan, extends the character set of BIG 5 to a total of 21,585 characters, including simplified characters. It is not widely implemented yet.

The national character set standard of Taiwan is actually CNS 11643-1992 (where CNS stands for Chinese National Standard). This is the largest Chinese character set standard, enumerating a total of 48.027 characters.

Hong Kong extended BIG 5 to include locale-specific characters and the new character set standard is know as GCCS ("Government Chinese Character Set).

The International Organization for Standardization  and the Unicode Consortium have been working together to combine most of the world's writing systems and character standards into a larger repertoire of characters. Unicode, a subset of ISO 10646-1:1993 (where characters have two- and four- byte representation), uses a variable-length 16-bit representation.

Unicode, as an attempt to unify all Chinese characters independent of any country or region that uses them, merges major Chinese character set standards into one larger repertoire and enumerates a total of 20,902 unique Chinese characters - including Chinese Hanzi, Japanese Kanji, Korean Hanja, and Vietnamese Chu Nom (CJKV) - from the earliest to the latest version (1.0 to 2.1). Chinese characters that differ only in glyphs are often combined to occupy a single point. The ordering of Chinese characters is largely culture neutral.

Unicode is under constant development and is gaining support from the industries. It has been used to implement many software products, including Microsoft Windows platforms since Windows 95, Microsoft Office 97, Internet Explorer, Sun's Solaris, Netscape Communicator, etc.

Encoding Methods

Encoding involves mapping a character to a numeric value so that the character can be identified through its associated numeric value. Computer systems process data in terms of bits, the most basic units of information processing. Bits are mapped to the 1 (on) and 0 (off) and are grouped together into units called bytes. Bytes can be composed of 7 or 8 bits. 7-bit bytes can allow up to 128 unique combinations while 8-bit bytes up to 256 unique combinations. While these numbers are good enough for encoding most writing systems of Western languages (such as the ASCII character set), with tens of thousands of distinct characters, they are far from enough to represent the writing system of Chinese (and those of other East Asian languages such as Japanese and Korean). The solution to this problem is to use multiple bytes to represent a single character. For example, an 8-bit 2-byte system can encode up to 65,536 (256x256) characters.

Below is a list of popular encoding methods.

Locale-Independent Encoding Methods

ISO-2022-CN is an implementation of ISO-2022 encoding, used primarily as an information interchange code for moving text between computer systems. ISO-2022 encoding in general is not efficient for internal storage or processing on computer systems.

ISO-2022-CN supports the following character sets: ASCII, GB 2312-80, CNS 11643-1992 Planes 1 and 2. Special characters including designator, single shift and shifting sequences are used to invoke two-byte modes or to switch between one- and two-byte modes. ISO-2022-CN-EXT provides support for additional character sets.

HZ encoding is a predecessor to ISO-2022-CN. It uses special characters ~, {,  and } to do code-switching. HZ encoding is popular for exchanging e-mail messages and post news in Chinese newsgroups (particularly alt.chinese.text).

GB 2310-80 is a locale-specific version of ISO-2022. It is still widely used.

EUC-CN is an instance of Extended Unix Code (EUC, implemented as the internal code for most Unix software). It is sometimes referred to as eight-bit GB or simply GB encoding outside the context of EUC encoding.

EUC encoding in general consists of 4 code sets, code set 0 to 3. In EUC-CN, code set 0 covers the ASCII or GB-Roman character set and code set 1 encodes GB 2312-80. It does not use code sets 2 and 3.

The crucial difference between ISO-2022 and EUC encodings is that of seven- versus eight-bit.

EUC-TW, the most complex instance of EUC encoding, is used for the Taiwan locale and encodes about 50,000 characters. Code set 2 is heavily loaded but code set 3 is not used.

Locale-Specific Encoding Methods

GBK encoding is virtually an extension of GB 2310-80, where K is the initial letter of Kuozhan ("extended"). It encodes the Chinese subset of ISO 10646-1:1993 with a total number of 21,886 characters. GBK is implemented as the internal code of simplified Chinese versions of Microsoft Windows 9x and IBM's OS/2.

GBK is a superset of GB 2310-80 and hence provides compatibility with the latter encoding.

BIG5 encoding mixes one- and two-byte encoding and the second byte values also extend into the 7-bit region. That is, both 7- and 8-bit bytes are used for the second byte of two-byte encoded characters. Some parts of BIG5 are equivalent to EUC-TW code sets 0 and 1.

BIG5+ extends BIG5 to include additional characters due to influences from Unicode and CNS 11643-1992. Hong Kong's version of BIG5 also extends BIG to accommodate specific characters used there.

The crucial difference between BIG5 and GBK encodings lies in the second byte, both in terms of character allocation and the kind of byte used.

International Encoding Methods - Unicode Encodings

The international character standards established by the ISO and the Unicode Consortium and their encoding methods are best summarized in the following table.

Character Set Encoding Methods
Unicode UCS-2, UTF-7, UTF-8, and UTF-16
ISO 10646-1:1993 UCS-2, UCS-4, UTF-7, UTF-8, and UTF-16

Please visit the Unicode Consortium's Web Site for specifications of these encoding methods.

Chinese Code Conversion

Conversion tables are need to convert one character set to another. The most frustrating situation, which is also the most frequent conversion, is when converting between GB 2312-80 and BIG5. There are three reasons:

  1. BIG5 has twice as many characters as GB 2312-80;
  2. Simplified characters (about one third) in GB 2312 are not available in BIG5; and
  3. Many of the simplified characters in GB 2312 were 'simplified' from two or more traditional forms - context (as in words or phrases) is need for the right mapping.

Input Methods

Typing Chinese characters into computers requires an input method and a conversion dictionary. Under any input method, the user types a string of letters, the computer looks up the conversion dictionary specified by the input method and presents the user with a list of choices/candidate characters to which the string is mapped. The user than types the number for the desired character to make the choice.

There are more than a dozen input methods available for typing Chinese characters. They fall into two categories: transliteration-based and structure-based.

Transliteration input methods

A transliteration input method is based on the Chinese phonetics. The most basic implementation of such input methods is the Pinyin method, which simply uses the Pinyin table as its conversion dictionary. For example, to input wpe1.jpg (718 bytes) as in wpe2.jpg (788 bytes) ("Chinese character"), simply type its Pinyin letters, han, on the standard keyboard. You will get a list of 30 or so candidates. Obviously, this is not an efficient input method since each Pinyin string can be mapped to many characters - the Pinyin string for the second character in this example offers about 40 candidates. Furthermore, the Pinyin representation of some characters can have up to 6 Roman letters.

There are several approaches to make a Pinyin-bases input method more efficient. First is the addition of tones. Since Chinese (Mandarin) has 4 tones, you can on average reduce the number of candidates by three quarters. Secondly, under some transliteration input methods, certain combinations of Pinyin letters are replaced with single letters and the number of key strokes is significantly reduced. One input method of the second the approach is known as Shuangpin ("Double Pinyin"), which simply uses two key strokes for any give character. Another way of reducing the number of candidates is to input more than one character at a time, for example, typing in words, phrases or even larger units.

The Taiwan counterpart of Pinyin input is known as Zhuyin (a.k.a. "bopomofo") since Taiwan uses Zhuyin symbols to transliterate Chinese characters. Keyboard used for Zhuyin input are often imprinted with "bopomofo" symbols (in addition to the standard Roman letters). As many Cantonese speakers, especially those that reside in Hong Kong, do not speak Mandarin, there are also transliteration input methods based on the Cantonese pronunciations.

Structure-based input methods

As mentioned in the Introduction, Chinese characters are structured in terms of radicals and strokes (of various shapes). By studying the structures of Chinese characters, some structure-based input methods have been devised and they usually use much fewer keystrokes that transliteration input methods. The best designed and perhaps the fastest input method of this kind is Wubi, devised by Wang Yongmin from mainland China. With this input method, common characters can be input with just 2 key strokes and the maximum number of key strokes for any character is four. What is really amazing about this input method there is no almost no need to choose because most of the key combinations are unique. Furthermore, the mechanism extends into multiple character words and phrases and assigns the maxim of 4 key strokes to them. A well-trained and experience user can type over 100 characters per minute, faster than you can speak the language! Notice that the name Wubi, which literally means "five strokes", has nothing to do with stroke number. It simply divides the keyboard into five major regions, hence the name.

However, structure-based input methods are usually difficult to learn and easy to forget. In addition, you have to know how to write a character in order to type it. By contrast, transliteration input methods are more intuitive and easy to grasp.

In addition to words and phrases input, there are two other 'intelligent' mechanisms that are employed in many input methods. One is dynamic re-ordering - the list of choices is constantly adjusted according to your frequency of word usage. This is particularly useful for input methods that incorporate words and phrases into their conversion dictionaries because the choice menu can be long and you have to 'scroll down' the menu to select the right candidate (usually with certain keys but sometimes with the mouse). Another mechanism is called 'association'. With association invoked, when a character and a word is being typed, a list of other characters that often follows as a word or phrase will appear, making subsequent typing possibly unnecessary.

Other input methods

There are other kinds of input methods. For example, Quwei (meaning "regional position") inputs Chinese characters using the Row-Cell numbers in GB 2310-82. Dianbaoma ("telex code") simply uses China's telex code developed in 1911. While these inputs are unambiguous, they are not useful at all unless you have some special talent and remember all the codes.

While the above-mentioned input methods all use the keyboard, non-keyboard input techniques, including pen and voice inputs, are now being developed and some products are already on the market.

Word Segmentation

Because Chinese is written without any space between words, word segmentation is a particular important issue for Chinese language processing. We will provide you with more information later on. For now, you can read the segmentation guideline (PDF format) for the Chinese Treebank Project. If you are interested, you can also read the project's parts of speech guideline (PDF format).