This file contains documentation for Chinese <-> English Name Entity Lists, v.1.0, Linguistic
Data Consortium (LDC) catalog number LDC2005T34 and ISBN 1-58563-368-2.
These Chinese-English bi-directional name entity lists
are compiled from Xinhua News Agency newswire texts. Not every irregularity
in the original source has been detected and normalized. Some Chinese
characters are not encoded in the source and brackets are used to describe
their composition. Except for the person name lists, most instances were left untouched in the created lists. An effort was made to replace
GB-encoded characters (such as Roman numbers) in the English translation
with ASCII characters. However no attempt has been made to do the opposite for
Chinese names. The use of slashes as delimiters presents another problem. Some names
may have internal slashes. Initially, double quotes ("") were used to enclose the name with
an internal slash to avoid confusion without realizing that these is just
one " in ASCII (as opposed to a set of enclosing " in GB). Later it was decided
to use &slash;. In future releases, some lists will be changed for greater
consistency. Finally, most of the English names in the source use lower
cases throughout. An effort was made to capitalize the initial letter (and
possibly some middle ones) for person names, but not for any other kind of
names as most other names have multiple words, some of which may contain
articles and prepositions.
The word "English" is somewhat misleading here. Although
most of the foreign words are English or can appear in English texts, there are
also many non-English words written in Roman alphabet - some of which may
have English equivalents while others do not. No efforts have been made to
eliminate those non-English names where English equivlants are available.
The entire set consists of nine pairs of lists. The English->Chinese version
of each pair was created by reversing the Chinese->English, both sorted by
the Unix built-in sort function.
The contents are as follows
- Place Names, Chinese to English: 276,382
- Place Names, English to Chinese: 298,993
- Organization Names, Chinese to English: 30,800
- Organization Names, English to Chinese: 37,145
- Corporate Names, Chinese to English: 54,747
- Corporate Names, English to Chinese 58,468
- Press Organization Names, Chinese to English: 29,757
- Press Organization Names, English to Chinese: 32,922
- Intl. Organization Names, Chinese to English: 7,040
- Intl. Organization Names, English to Chinese: 7,040
For an example of the data in this publication, please view this screen capture of the corporate names list.
Portions © 2001 Xinhua News Agency, © 2002, 2005 Trustees of the University of Pennsylvania
The Reduced Licensing Fee for this corpus is US$100.