| This corpus contains ASCII versions of the CELEX lexical databases of
English (Version 2.5), Dutch (Version 3.1) and German (Version 2.0).
CELEX was developed as a joint enterprise of the University of
Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck
Institute for Psycholinguistics in Nijmegen, and the Institute for
Perception Research in Eindhoven. Pre-mastering and CD-ROM production
was done by the LDC.
For each language, this CD-ROM contains detailed information on:
- orthography (variations in spelling, hyphenation)
- phonology (phonetic transcriptions, variations in pronunciation,
syllable structure, primary stress)
- morphology (derivational and compositional structure,
inflectional paradigms)
- syntax (word class, word class-specific subcategorizations,
argument structures)
- word frequency (summed word and lemma counts, based on recent and
representative text corpora)
The databases have not been tailored to fit any particular
database management program. Instead, the information is in ASCII
files in a UNIX directory tree that can be queried with tools, such as
AWK or ICON. Unique identity numbers allow the linking of information
from different files. Some kinds of information have to be computed
online; wherever necessary, AWK functions have been provided to
recover this information. README files specify the details of their
use.
A detailed User Guide
describing the various kinds of lexical information available is supplied.
All sections of this guide are
POSTSCRIPT files, except for some additional notes on the German
lexicon in plain ASCII.
CELEX-2
The second release of CELEX contains an enhanced, expanded version of the German lexical
database (2.5), featuring approximately 1,000 new lemma entries, revised morphological
parses, verb argument structures, inflectional paradigm codes and a corpus type lexicon.
A complete PostScript version of the Germanic Linguistic Guide is also included, in both
European A-4 format and American Letter format. For German, the total number of lemmas
included is now 51,728, while all their inflected forms number 365,530.
Moreover, phonetic syllable frequencies have been added for (British) English and Dutch.
Apart from this, and provision of frequency information alongside every lexical feature,
no changes have been made to Dutch and English lexicons.
Complete AWK-scripts are now provided to compute representations not found in the (plain
ASCII) lexical data files, corresponding to the features described in CELEX User Guide,
which is included on the CD as well.
For each language, i.e. English, German and Dutch, the CD-ROM contains detailed
information on the orthography (variations in spelling, hyphenation), the phonology
(phonetic transcriptions, variations in pronunciation, syllable structure, primary
stress), the morphology (derivational and compositional structure, inflectional
paradigms), the syntax (word class, word-class specific subcategorisation, argument
structures) and word frequency (summed word and lemma counts, based on resent and
representative text corpora) of both wordforms and lemmas. Unique identity numbers allow
the linking of information from different files with the aid of an efficient, index-based
C-program.
Like its predecessor, the CD-ROM is mastered using the ISO 9660 daa format, with the Rock
Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh and UNIX environments.
As the new release does not omit any data from the first edition, the current release will
replace the old one.
Content Copyright |