Combining UML, XML and relational database technologies - the best of all worlds for robust linguistic databases Larry Hayashi and John Hatton SIL International This paper describes aspects of the data modeling, data storage, and retrieval techniques we are using as we develop the FieldWorks suite of applications for linguistic and anthropological research. While the data is stored in an off-the-shelf relational database, the data modeling is done in the Unified Modeling Language (UML) and data can be input and output using XML. Thus Fieldworks benefits from both the maturity of its database engine and the productivity of these formats. The hierarchical nature of linguistic data lends itself to object-oriented analysis (OOA) techniques. Fieldworks uses object-oriented data models represented in UML. Over the last 2 decades some of the major OOA notation methods have converged into UML. As a standard notation it allows for greater communication between groups that want to share research. The XML Metadata Interchange standard (XMI) - a document type definition for storing UML models in XML format - allows groups to not only communicate using a common standard, but to share data models with minimal re-implementation effort. We create our linguistic data models in a UML tool which stores the model as XMI. From the XMI file, we use standard XML transformations to generate the following products: * HTML documentation of the models, * a database schema which is run through a C++ program to create a SQL Server relational database, * an XML schema definition for validating XML data files to be imported into the generated database and * an XDR schema for SQL Server that enables it to output relational data as hierarchical XML. This XDR schema has annotations describing the mapping between the XML elements and the relational tables. After submitting this schema to the SQL Server, we can query the database using either SQL or XPath and get XML documents. We then use XSLT style sheets to produce other data formats, including HTML for display and Standard Format for input into our legacy parsing engine. Our use of the mapping schema is still in experimental stages. Without it, we have found it difficult to rapidly construct applications on top of the FieldWorks system, in part because the developer must think and work at both the relational and object-oriented levels simultaneously. Using the mapping schema we can use Microsoft's SQL Server XML functionality to query the database using XPath (rather than SQL) and generate hierarchical XML documents. The advantage is that the developer does less translating between object-oriented data and relational representation. The implication is, hopefully, faster development time. Another further implication is the potential for an increased interoperability between tools of different developers. A mapping schema could be made for the FieldWorks linguistic tools that creates XML data usable by other tools. Ideally, XML standards will be developed for lexicons and annotated texts. Data could then be shared among different tools - much in the same way that XMI allows UML data to be used in different modeling tools. References: Feldman, B. Unknown. UML FAQS. http://www.uml-zone.com/umlfaq.asp Simons, Gary F. 1994. Conceptual modeling versus visual modeling: a technological key to building consensus. SIL International. http://www.sil.org/cellar/ach94/ach94.html Simons, Gary F. 1995. Multilingual data processing in the CELLAR environment. SIL International. http://www.sil.org/cellar/mlingdp/mlingdp.html Web Resources: Choosing a UML Tool. http://www.objectsbydesign.com/tools/modeling_tools.html SQL Server and XML Support. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/architec/8_ar_cs_9oj8.asp UML Specification. http://www.omg.org/technology/documents/formal/omg_modeling_specifications.htm XMI Specification. http://www.omg.org/technology/documents/formal/omg_modeling_specifications.htm