Folie 13 von 26
A corpus with its range of language occurrences is proportional to the full range of occurrences in the language itself. This point of view, and the problems connected with it, is gaining in importance with the growing scientific focus on the quantitative evaluation of corpora. The difficulties have already been discussed in the late sixties, considering reasonable corpus economies in the face of limited computational power. In the late seventies the concept of "representative corpora" was criticized (cf. B. Rieger: Repräsentativität. Von der Unangemessenheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung, in: H. Bergenholz / B. Schäder: Empirische Textwissen-schaft, Königstein 1979, S. 52ff.), and later were developed the concept of "balance of the corpus" (vgl. z.B. J. Sinclair: Corpus, Concordance, Collocation, Oxford 1991, S. 13ff).
We follow our own approach. Every corpus has a property that we for now call the "degree of saturation". It means: using your corpus do the calculation for a random language occurrence; add another text to your corpus and repeat the calculation. The degree of saturation increases with the decrease in variation of your statistics. When the statistical results become invariant when expanding your corpus there is no further need to do so.
The property of corpora has been described also by the entropic theoreme of Shannon (cf. C. Shannon, W. Weaver: The Mathematical Theory of Communication, Urbana 1949. F. Bauer, G. Goos: Informatik, Eine einführende Übersicht, Berlin, New York, Heidelberg 1971) and can be widely used. E.g. it can replace the classical term "representative" which is hard to define work with . It also defines the minimum size for a corpus with which one can expect reasonable statistical results.
On the basis of this we regard the here defined term "virtual corpus" a new method that can be of good use for a statistic analysis of language.