What's needed for linguistic databases? Experiences with Kirrkirr Christopher Manning and Kristen Parton In recent years, there has been much work on semi-structured data and databases for such data (inter alia, Abiteboul et al. 1999). However, the term 'semi-structured data' spans a continuum between completely structured data, which people have simply chosen to encode in XML, to moderately structured data, to quite unstructured, often textual, data. Linguistic databases, for both good and bad reasons, tend to be at this unstructured end. Unfortunately for us, most of the research on semi-structured databases has focused on the quite structured end (McHugh et al. 1997, Florescu and Kossman 1999), with only limited work aimed at text databases (Rizzolo and Mendelzon 2001). The crucial, insufficiently addressed observation is that in quite unstructured databases, the content of fields is also likely to be quite free-form, and conventional indices are of limited use. In this paper, we relate the general issues to our experiences with Kirrkirr, a dictionary interface for indigenous languages aimed at unsophisticated users. Although the project has experimented with XML databases and query languages, particularly the GMD IPSI XQL engine (Jansz et al. 2000), the options we are aware of do not offer convincing advantages across speed, functionality, and memory footprint, and the current version does not use a database with a query language. The project extensively uses XML, DOM, XPath, and XSLT, but most ad hoc querying is implemented by regular expressions over raw text data. In-memory ad hoc indices over two fields (words and glosses) are used for speed. Why do we do this? The Warlpiri dictionary with which we mainly work is larger than most indigenous language dictionaries, and at 10 Mb is also larger than some of the standardly used databases for benchmarking semi-structured data (DBLP, IMDB). Although faster performance would be available with indexing (Rizzolo and Mendelzon 2001), people can easily wait the 2 seconds it takes us to grep on a modern computer. Most of our queries are primarily aimed at textual content, delimited by XML entities, with simple intersection or alternation, rather than complex join conditions or extensive use of path expressions. More importantly, most of the queries we ask cannot be answered by simple text indices: we extensively use morphological analysis, regular expressions, and substring matching to do fuzzy spelling, online morphological parsing, intuitive substring matching, and for making a Warlpiri-English dictionary look as if it is an English-Warlpiri dictionary on the fly. These are essential features for making the system usable when the data is quite unstructured textual data, but fall outside what XQL, XQuery, or other XML query languages provide for us. This situation is regrettable: we would prefer to use a database with an appropriate query language, and we note that some databases, such as MySQL if not standard SQL, do support regular expressions, but at present query languages do not seem to adequately provide the functionality of text processing techniques.