Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


Lexicon: Corpus specification/structure of entries

In creating an electronic lexicon, one must be clear about two close-related points: What the lexicon is used for, and what kind of words it should cover.
If the lexicon is intended just for one particular corpus and not anywhere else, one can start the project with a given list of entries acquired by cutting the corpus into words and sorting them. However, some languages have segmentation problems of different degrees. For example, languages such as Chinese and Japanese do not space words and therefore a set of segmentation principles is neccesary. For these languages, it is easier to start with an existing electronic wordlist, to the extent the copyright restrictions allow, than creating a lexicon from scratch. Additional entries from the corpus can be cheched against the wordlist for wordhood and added to the lexicon if they are determined to be words. (Even if the language uses spacing, rich inflection or agglutination may multiply the number of the entries).
If there is no preestablished delimitation of coverage, it often takes place that time factor gradually weighs upon the cost of the project. It should be noted that a vast majority of the tokens of words occurring in texts are covered by a few thousands of the most common vocabulary of the language, and cost effect rises in geometric progression when one tries to cover the rest of the vocabulary.
Related to the purpose of the lexicon is the question of what information should be provided for each entry. If the lexicon is intended for machine translation, structured gloss would be more useful than detailed definition. A transducer or parser may need specific information on parts of speech or frequency, depending on its algorithm. A pronunciation dictionary needs phonetic transcription and information on variation, but gloss is not necessary for most words.
We have to remember that the aim of an electronic lexicon is different from that of printed dictionaries. The entries of an electronic lexicon, for example, should contain all orthographical or inflectional variations, if they are not otherwise derived. Since programming languages handle simple one- or two-dimensional data structures more easily, embedded entries as in printed dictionaries should be avoided for an electronic lexicon. Information for generating all the derivative forms is necessary, whereas gloss or definition may not necessarily be a top priority. Transliteration is also necessary if it is not uniquely derived from the head word.
A good lexicon should be machine independent so that it can easily adapted to any operating systems and/or applications. Tab delimited text files are usually preferred. Emacs with multilingual extension is one of the best editors for this purpose.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 22-Oct-2002 17:44:57 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.