Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Chinese-English Translation Lexicon Version 3.0

Item Name: Chinese-English Translation Lexicon Version 3.0
Authors: Shudong Huang and David Graff
LDC Catalog No.: LDC2002L27
ISBN: 1-58563-238-4
Release Date: Jun 17, 2002
Data Type: lexicon
Data Source(s): dictionaries, web collection
Project(s): GALE, TIDES
Language(s): English, Mandarin Chinese
Language ID(s): ENG
Distribution: Web Download
Member fee: $0 for 2002 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Shudong Huang and David Graff
2002
Chinese-English Translation Lexicon Version 3.0
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation on the 2002 Chinese-English Translation Lexicon Version 3.0, Linguistic Data Consortium (LDC) catalog number LDC2002L27 and ISBN 1-58563-238-4.

In 1999, responding to urgent demand for a Chinese-English bilingual wordlist to support various projects, LDC quickly solicited entries from both in-house and Internet resources and compiled two versions of Chinese-English wordlists, "ldc_ce_dict.1.0.gb" (henceforth Version 1) and "ldc_ce_dict.2.0.txt" (henceforth Version 2), available for free to the general public at http://www.ldc.upenn.edu/Projects/Chinese/.

The hastily compiled LDC Version 1, with its 24,298 entries, was relatively small and unbalanced coverage. Research sites reported that some definitions were not suitable for machine translations, etc. and showed great interest in an updated version. Version 2, created as an experiment, has proven impractical for translingual information processing. Many of its entries were created by applying simple tricks such as reversing source and target language fields in various English-to-Chinese wordlists; as a result many entries are not really words. The increasing demand for richer lexical resources lead to the birth of the present release, "ldc_cedict.gb.Version 3" (henceforth Version 3).

Data

What's New in Version 3

The total number of Chinese headwords in this release is 54,170.

In terms of coverage, Version 3 is a superset of Version 1 and the LDC's Mandarin pronunciation lexicon (Version 3/Version 4). The pronunciation lexicon has a total of 44,404 entries, or 43,968 unique Chinese character strings (i.e. with pronunciation removed). There are still 553 entries from the pronunciation lexicon not found in Version 3. We were unable to provide accurate translations for these head words for various reasons: they may be very technical; they don't make sense unless their source is re-examined; they may have segmentation errors; or they may be rare words for which appropriate translations could not be found due to limited time and resources.

Version 3 also left out less than 40 entries from Version 1. Most of these are rare single-character words whose translations cannot be verified for accuracy.

Format

There is one data file, the lexicon itself. Within the lexicon, each entry is in this format:

head_word_in_Chinese_characters /gloss 1/gloss 2/.../gloss n/
For example:

ººÓï    /Chinese language/Chinese/
Ó¢ÎÄ    /English language/English/
(A Chinese-capable browser is needed to see this properly. You may need to change your browser's character set to see Simplified Chinese characters.)

Updates

There are no updates at this time.

Content Copyright

Portions © 2002 Trustees of the University of Pennsylvania.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.