Lexicon: Corpus specification/structure
In creating an electronic lexicon, one must be clear about two close-related
points: What the lexicon is used for, and what kind of words it should cover.
If the lexicon is intended just for one particular corpus and not anywhere
else, one can start the project with a given list of entries acquired by
cutting the corpus into words and sorting them. However, some languages
have segmentation problems of different degrees. For example, languages
such as Chinese and Japanese do not space words and therefore a set of segmentation
principles is neccesary. For these languages, it is easier to start with
an existing electronic wordlist, to the extent the copyright restrictions
allow, than creating a lexicon from scratch. Additional entries from the
corpus can be cheched against the wordlist for wordhood and added to the
lexicon if they are determined to be words. (Even if the language uses spacing,
rich inflection or agglutination may multiply the number of the entries).
If there is no preestablished delimitation of coverage, it often takes place
that time factor gradually weighs upon the cost of the project. It should
be noted that a vast majority of the tokens of words occurring in texts
are covered by a few thousands of the most common vocabulary of the language,
and cost effect rises in geometric progression when one tries to cover the
rest of the vocabulary.
Related to the purpose of the lexicon is the question
of what information should be provided for each entry. If the lexicon is
intended for machine translation, structured gloss would be more useful
than detailed definition. A transducer or parser may need specific information
on parts of speech or frequency, depending on its algorithm. A pronunciation
dictionary needs phonetic transcription and information on variation, but
gloss is not necessary for most words.
We have to remember that the aim
of an electronic lexicon is different from that of printed dictionaries.
The entries of an electronic lexicon, for example, should contain all orthographical
or inflectional variations, if they are not otherwise derived. Since programming
languages handle simple one- or two-dimensional data structures more easily,
embedded entries as in printed dictionaries should be avoided for an electronic
lexicon. Information for generating all the derivative forms is necessary,
whereas gloss or definition may not necessarily be a top priority. Transliteration
is also necessary if it is not uniquely derived from the head word.
A good lexicon should be machine independent so that it can easily adapted to any
operating systems and/or applications. Tab delimited text files are usually
preferred. Emacs with multilingual extension is one of the best editors
for this purpose.