"Using Relational Databases to Create Unlimited and User-Defined
Annotation on Large Corpora: A 100 Million Word Corpus of
Historical and Modern Spanish"
Mark Davies, Illinois State University
As part of a project that is funded by the National Endowment for the
Humanities, I am developing an online, searchable, 100 million word
corpus of historical and modern Spanish texts
(http://mdavies.for.ilstu.edu.corpus).
In addition to basic searches by word forms, users will also be able to
carry out complex searches by part of speech, lemma, synonyms, and
frequency information. Examples of these might be:
- ·
the
frequency and usage of all synonyms of triunfar (“to win/triumph”)
in each of the centuries from the
1200s-1900s
- ·
all
strings involving < [cl] PODER/DEBER [inf] > (a clitic followed by
a form of poder or deber, followed by an infinitive) that
occur more than three times in the 1500s or
1600s
- ·
examples
of all nouns deriving from Arabic that occur in the context < [DET] *
[lemma=rico] > at least two times in the 1200s or 1300s
In order to permit such a wide range of search, the corpus is
composed of two distinct levels of corpora. One level consists of
the actual 100 million words of text, whose only annotation is a code
identifying the source text, and which is indexed via SQL Server
“full-text” indexing. The output from this corpus is in KWIC
format, and allows traditional sorting and limiting by left and right
contextual elements.
The more important level of “annotation”, however, consists of very large
databases that contain every distinct one, two, and three word cluster in
the corpus. Information from other databases has been merged with
these databases to provide information on the part of speech and lemma
for each of these distinct words and strings. This main database is
also linked to other auxiliary databases that contain synonyms and
dictionary entries containing etymologies and other word-level
information. Any number of other supplementary databases can easily
be integrated as well, including user-generated databases of specific
words and phrases.
The crucial point is that all of this information (including frequencies
for each distinct word and phrase in each of the centuries, 1200s-1900s)
is already stored in the underlying databases, which is then linked to
the actual corpus itself. It would have been extremely difficult to
actually annotate (a priori) the actual 100 million word corpus itself
with all of this information, especially the frequency counts. In
addition, it probably would not have been possible to search the entire
100 million word corpus in the two or three seconds, which is the speed
achieved with my approach.