"Using Relational Databases to Create Unlimited and User-Defined Annotation on Large Corpora: A 100  Million Word Corpus of Historical and Modern Spanish"

Mark Davies, Illinois State University

As part of a project that is funded by the National Endowment for the Humanities, I am developing an online, searchable, 100 million word corpus of historical and modern Spanish texts (http://mdavies.for.ilstu.edu.corpus).  In addition to basic searches by word forms, users will also be able to carry out complex searches by part of speech, lemma, synonyms, and frequency information.  Examples of these might be:
·       the frequency and usage of all synonyms of triunfar (“to win/triumph”) in each of the centuries from the 1200s-1900s
·       all strings involving < [cl] PODER/DEBER [inf] > (a clitic followed by a form of poder or deber, followed by an infinitive) that occur more than three times in the 1500s or 1600s
·       examples of all nouns deriving from Arabic that occur in the context < [DET] * [lemma=rico] > at least two times in the 1200s or 1300s

In order to permit such a wide range of search, the corpus is composed of two distinct levels of corpora.  One level consists of the actual 100 million words of text, whose only annotation is a code identifying the source text, and which is indexed via SQL Server “full-text” indexing.  The output from this corpus is in KWIC format, and allows traditional sorting and limiting by left and right contextual elements.

The more important level of “annotation”, however, consists of very large databases that contain every distinct one, two, and three word cluster in the corpus.  Information from other databases has been merged with these databases to provide information on the part of speech and lemma for each of these distinct words and strings.  This main database is also linked to other auxiliary databases that contain synonyms and dictionary entries containing etymologies and other word-level information.  Any number of other supplementary databases can easily be integrated as well, including user-generated databases of specific words and phrases.

The crucial point is that all of this information (including frequencies for each distinct word and phrase in each of the centuries, 1200s-1900s) is already stored in the underlying databases, which is then linked to the actual corpus itself.  It would have been extremely difficult to actually annotate (a priori) the actual 100 million word corpus itself with all of this information, especially the frequency counts.  In addition, it probably would not have been possible to search the entire 100 million word corpus in the two or three seconds, which is the speed achieved with my approach.