Introduction
This file contains documentation on the 2002 Chinese-English Translation
Lexicon Version 3.0, Linguistic Data Consortium (LDC) catalog number
LDC2002L27 and ISBN 1-58563-238-4.
In 1999, responding to urgent demand for a Chinese-English bilingual
wordlist to support various projects, LDC quickly solicited entries
from both in-house and Internet resources and compiled two versions
of Chinese-English wordlists, "ldc_ce_dict.1.0.gb" (henceforth Version 1)
and "ldc_ce_dict.2.0.txt" (henceforth Version 2), available for free to the
general public at http://www.ldc.upenn.edu/Projects/Chinese/.
The hastily
compiled LDC Version 1, with its 24,298 entries, was relatively small and
unbalanced coverage. Research sites reported that some definitions were
not suitable for machine translations, etc. and showed great interest in
an updated version. Version 2, created as an experiment, has proven impractical
for translingual information processing. Many of its entries were created
by applying simple tricks such as reversing source and target language fields
in various English-to-Chinese wordlists; as a result many entries are not
really words. The increasing demand for richer lexical resources lead to
the birth of the present release, "ldc_cedict.gb.Version 3" (henceforth Version 3).
Data
What's New in Version 3
The total number of Chinese headwords in this release is 54,170.
In terms of coverage, Version 3 is a superset of Version 1 and the LDC's Mandarin
pronunciation lexicon (Version 3/Version 4). The pronunciation lexicon has a total of
44,404 entries, or 43,968 unique Chinese character strings (i.e. with
pronunciation removed). There are still 553 entries from the
pronunciation lexicon not found in Version 3. We were unable to provide accurate
translations for these head words for various reasons: they may be very
technical; they don't make sense unless their source is re-examined; they
may have segmentation errors; or they may be rare words for which
appropriate translations could not be found due to limited time and
resources.
Version 3 also left out less than 40 entries from Version 1. Most of these are rare
single-character words whose translations cannot be verified for accuracy.
Format
There is one data file, the lexicon itself. Within the lexicon, each
entry is in this format:
head_word_in_Chinese_characters /gloss 1/gloss 2/.../gloss n/
For example:
ººÓï /Chinese language/Chinese/
Ó¢ÎÄ /English language/English/
(A Chinese-capable browser is needed to see this properly. You may need to change your browser's character set to see Simplified Chinese characters.)
Updates
There are no updates at this time.
Content Copyright
Portions © 2002 Trustees of the University of Pennsylvania. |