Discovering and Testing Linguistic Generalizations
Using Interactive Concordances

Larry Hayashi, SIL International

SIL International
7500 W. Camp Wisdom Rd, Dallas, TX USA 75137
larry_hayashi@sil.org | www.sil.org


While studying empirical methods in linguistics I was taught "If it happens once, you don't know anything. If it happens twice, it suggests further investigation. If it happens three or more times, then you have something to write about!" Therefore, finding multiple data occurrences that substantiate your claims is an essential part of analytical rigor. Concordance tools provide the means to do just that. The term concordance usually refers to an alphabetical index of all the words in a text or corpus of texts, showing every contextual occurrence of a word. Here, I extend the range of the term to include indexes of more than just words - you can "concord" on any possibly recurring object that occurs in the corpus or its analyses (examples: part-of-speech, gloss, syntactic construction, lemma, morpheme, string of characters, case role, phone, etc.). Concordances are essentially filters or queries. What is unique to them is that they are filters applied to a corpus of texts rather than some higher level analysis done by the linguist (for example, the lexicon and all of its entries). This characteristic is what makes the concordance invaluable for empirical linguistics.

Historically, concordances have been printed indexes of well-known literature such as the Bible or the works of Shakespeare. Similarly, most traditional computational concordance tools look through a given corpus of texts in search of a particular phenomenon and then generate a separate and static file with a list of occurrences and any relevant associated data (reference, interlinear annotations). In contrast, using current relational and object-oriented databases, a concordance can be a view of the corpus data instances themselves rather than copies collected in a separate file. This has a number of significant advantages for empirical linguistic analysis including:

  1. the ability to easily jump to the broader context of each data instance.
  2. the ability to edit the corpus and its analyses and have those changes immediately reflected in any relevant concordances thus the linguist has more immediate feedback in testing hypotheses.
  3. the possibility to tag data as a collection. Because the concordance consists of a collection of the data instances themselves, the linguist has direct access to the data and can apply any tags to the data in the concordance. For example, if I posit that there is a noun sense and a verb sense of the English word `run', I should be able to bring up a list of all text occurrences of the word `run' in order to see the variety of contexts of this word and to "cognitively" verify whether both these senses exist. After bringing up the list of occurrences, I should then be able to interact with those occurrences. In particular, I should be able to easily tag those occurrences that instantiate the noun sense and those that instantiate the verb sense. Traditionally, tagging tools and concordance tools have been separate. Using current technologies, this separation no longer need exist.

My presentation will focus on how a number of SIL tools employ the power of interactive concordances. LinguaLinks (http://www.sil.org/lingualinks) uses a robust object-oriented data model to provide easy and interactive concordance creation on a number of different linguistic objects. Speech Manager (http://www.sil.org/computing/speechtools/speechmanager.htm) utilizes a relational data model to provide interactive concordances for phonological analysis. We are currently working on a successor to LinguaLinks called FieldWorks that will provide better performance and a more complete data model for morphology, syntax and discourse analysis.

References

Barlow, Michael. Web site: Corpus Linguistics. http://www.ruf.rice.edu/~barlow/corpus.html. Includes a list of various text corpora available for research as well as a list of concordance tools.

Simons, Gary F. 1994. Conceptual modeling versus visual modeling: a technological key to building consensus. SIL. http://www.sil.org/cellar/ach94/ach94.html.


Linguistic Exploration Workshop