Interoperable Extensible Linguistic Databases Angelo Dalli (adal002@um.edu.mt) Maltilex Project, Department of CS & AI, University of Malta (http://mlex.cs.um.edu.mt) Linguistic databases that are currently available for research and development can be currently classified as a heterogeneous collection of different proprietary databases with minimal means, if any, of interoperability with other linguistic databases making it hard to extend the database usefulness beyond the life of their originating projects [1]. This paper discusses an interoperable extensible linguistic database system developed for the Maltilex Project at the University of Malta [2]. Relational database technology is used to create a core set of tables that define a lexicon together with basic information on the words in the lexicon. Database population is performed by a weakly supervised machine learning technique that groups word forms under one or more lemmas automatically. The main advantage of using lemmas rather than individual word forms themselves as the basic unit of reference is enhanced flexibility. Every lemma can be assigned different semantical relationships and can optionally store some individual word forms explicitly and generate the rest of the word forms that conform to some known rule. The core can be extended indefinitely through the use of extension tables that are all related in some manner to the lexicon core. A special extension API is used to register and create new additions to the linguistic database. Interoperability is achieved through two means - flexible XML export methods and a Simple Object Access Protocol (SOAP) based server that uses the Web Services Description Language (WSDL) [3, 4]. The core table fields can be mapped almost directly to the XCES data representation standard defined by the EAGLES-ISLES projects [5]. Transformation tables are used to convert the linguistic data, stored efficiently in the form of relational database records, to an XML file. Standard SQL selection queries can be used to filter the export data efficiently prior to conversion to XML. Alternatives like XSL can transform XCES data to other XML formats as needed, but generally result in great performance penalties for huge amounts of data. The SOAP server provides XML-based interactions between different linguistic databases and systems over the HTTP protocol. Data records can be imported and exported in XML format and converted into efficient relational records transparently. Server-side processing can also be utilised effectively to reduce the load on the client. WSDL is used to describe the services provided by the linguistic database system in a standard manner, significantly reducing the development time for new clients and analysis programs. A sample standard WSDL definition for linguistic database systems is presented. Such a standard would enable interoperability between different projects to be realized with minimal effort. Recent developments like UDDI will facilitate the development of flexible and secure but easily accessible linguistic databases and processing resources. References [1] Cunningham, Hamish. A Definition and Short History of Language Engineering. Journal of Natural Language Engineering, pp. 1-16, vol. 5. Cambridge University Press, 1999. [2] Rosner, Michael et al. Linguistic and Computational Aspects of Maltilex. Proceedings of the ATLAS Symposium, Tunis, May 1999. [3] Box, Don et al. Simple Object Access Protocol (SOAP) 1.1. W3C Note, May 2000. http://www.w3.org/TR/SOAP. [4] Christensen, Erik et al. Web Services Description Language (WSDL) 1.1. W3C Note, March 2001. http://www.w3.org/TR/wsdl. [5] Expert Advisory Group on Language Engineering Standards (EAGLES). http://www.ilc.pi.cnr.it/EAGLES/home.html. International Standards for Language Engineering (ISLE). http://lingue.ilc.pi.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm.