XML and the Corpus-Based Dictionary

Development and Implementation of the Pennsylvania Sumerian Dictionary

Steve Tinney, University of Pennsylvania

Table Cells: Lexicon, all cells. Also feasible to derive Word List from this form of lexicon.

Abstract

This paper will discuss problems and solutions of designing an XML document structure for a comprehensive corpus-based dictionary of Sumerian. Issues which must be accomodated include linguistic uncertainties, the broad generic range of materials, a three-thousand year span of usage, ``disciplinary standards'' and the need for incremental and cooperative development of corpora and dictionary.

The NEH-funded Pennsylvania Sumerian Dictionary Project is in the final stages of defining a document structure which integrates text-corpora (including notation of the manuscripts that comprise them) and lexical descriptions, as well as other necessary data such as bibliography, abbreviation lists, collections of prosopographical data and other realia needed for computational linguistics work on the corpora.

The presentation is weighted towards discussion of the definition of a DTD for a Corpus-based dictionary, a snapshot of which is be included.


Introduction

The present paper addresses in a preliminary fashion several aspects of interest to Explorations 2000, without meaning to prejudice discussion which is already taking place both in other projects and in other papers announced for this meeting. This paper represents an account of a snapshot in an ongoing development process. In some cases individual DTD files contain further comments on issues of concern.

Three specific data-structures are defined here: a registry for target terms used in corpus tagging; a lexical article type; and a structure for composing a corpus-based dictionary from various building blocks. In some cases this work overlaps with other contributions to Explorations 2000; the material offered here is provided merely as an advance basis for discussion.

The Pennsylvania Sumerian Dictionary Project (PSD)

The PSD was begun in the mid-70's as a long-term project based on traditional lexicographical approaches and with the assumption that it was reasonable to take as long as was necessary to produce a complete, exhaustive multi-volume dictionary of Sumerian of all periods, places and genres. In the last decade new technologies and revised perspectives have overtaken the project, and we have adapted to meet the new demands.

Data and Design

The following factors are among the most important influences of the nature of the data-set on the design of the project:

Additionally, part of the reconception of the project is to frame it as a process rather than a product. Thus, we expect to add continually to the corpora as more texts appear, to revise the dictionary periodically in accordance with improved understanding, and to carry out ongoing semantic research to ensure that the dictionary remains as relevant and useful as possible.

Deliverables and Design

The present immediate goal is to produce within the next 30 months a deliverable consisting of a complete registry of primary and secondary terms in Sumerian with basic equivalences in English and Akkadian, together with extensive corpora and the necessary software to use the lexicon and corpora interactively to explore usage dynamically. This phase of the project is currently supported by the NEH. Following the production of the initial lexicon and tools, we plan to prepare systematically structured lexical articles which serve as a more complete key to the usages evidenced in the corpora.

Several design decisions impact on the choices made for automating the Pennsylvania Sumerian Dictionary:

An extended outline of the user-agent features which in part drive the CBD design is given here.

The Corpus-Based Dictionary

The design constraints outlined briefly above led to the decision to remodel the Pennsylvania Sumerian Dictionary from a series of books to a corpus-based dictionary (CBD).

Such a dictionary is both a presentation form and a research tool. As a macro-structure, it encompasses several other components which can be defined independently. The goal of this structure is not to replace all of the functionality of the components, but rather to organize relevant aspects of them in a tightly defined abstraction which contains only the data necessary for the implementation. Definitions (probably XSLT scripts) of how to generate CBD subsetted data from standardized components (e.g., a list of morphemes subsetted from a grammatical description) should be part of the repository but can only be defined after the adoption of the component datasets.

The CBD is a view of possibly much more complex datasets which is not intended to fulfill the same functions as the original data. It is intended to fix (perhaps only temporarily or intermittently) moving-target corpora and expose a restricted set of tags to application-writers who wish to manipulate a dictionary without having to handle more generalized or idiosyncratically targeted tagsets.

By design, escapes should be provided from various places in the CBD back to the component information realms from which the CBD's data subsets are extracted. Thus, the CBD may also serve as a kind of portal to datasets and research tools developed for specific research domains (e.g., writing-system descriptions, phonology, grammar, stylistics).

Given properly defined components, definition of a corpus-based dictionary is relatively simple, as illustrated in CBD.dtd (in general, the DTD's included with this paper are informal rather than final; in particular, use of XML namespaces requires discussion and agreement at the meeting).

The CBD contains material required for tagging constraints (terminological registries); material required for editorial constraints (abbreviations lists and bibliographies); the corpora themselves; and lexical articles.

These components, which have lives and uses of their own, need to be designed in such a way as to facilitate inclusion in the overarching formally defined CBD. The formal definition of a CBD structure, which goes beyond the scope of other component projects, is of course a necessary requirement for the production of user-agents/interfaces designed to work with an integrated group of components.

ID Management

The design of the CBD assumes that ID's will be managed by the system rather than by users. ID's will be assigned by the system when importing material or creating it, and the user-interface will provide facilities to select items from menus composed via keys and supply the appropriate IDREF to the selected item.

Navigating Outwards from the CBD

Many of the components of the CBD should link the extracted or subsetting information to the original data-source or description. This would permit navigation from an instance of the tagging of an item, to the dataset or description which more fully describes it, or permits exploration of it.

Terminological Registries

For an XML-based CBD, very tight validation of the corpus and all its analytical tagging should be a primary concern, as the quality of the dictionary is directly dependent on the quality of the corpus-tagging. The consequence of this is that corpus-validation processes (i.e., validation of the [possibly subsetted] corpus which is integrated into the CBD) should have easy access to the full set of acceptable values for items tagged in the corpus. This can be accomplished by defining a set of registries where the values are specified, and which can then be referenced in an XML-Schema to specify constraints on the values of element-content and attributes utilized in corpus-tagging.

The REGISTRY element contains a TYPE attribute giving the type of items; expected values might be grapheme, morpheme, etc. The actual content of the attribute could be restricted by an XML-Schema specific to the dictionary instance.

Each ITEM has a NM attribute giving a possibly non-unique name. To handle ambiguity, an ITEM may have a sequence of SUBITEMs, which inherit the ITEM's name, but use different keys and/or values to provide disambiguation.

The registry definition is intended to serve several purposes. Firstly, within the CBD validation of key/value pairs used in the lexical data analysis can be performed automatically.

Secondly, it is intended to facilitate corpus-tagging by providing a formal framework for defining the legal range of values for keys of various types.

Corpus-tagging might refer to registry items either by the ID's they are assigned by the system, or by a sequence consisting of a name and values, e.g., <word k="du3 build"/> (where /du3/ is the Sumerian verb for ``to build''). This permits initial machine-based tagging with subsequent interactive disambiguation, as well as providing a means for specifying key values which have not yet been entered into a registry.

Corpora

The Sumerian dictionary project cooperates with several other projects which are preparing corpora of texts of certain genres. Additionally, it is desirable to anticipate the possibility that specialists in a particular subsection of the corpora might take an existing version of a group of texts and improve substantially on it. Conversely, work on the dictionary project often suggests improvements to text readings or interpretations which should find their way back into the corpora to which they are relevant.

The fact that the corpus-preparers and the dictionary-preparers may not be the same imposes certain requirements on the design of the corpus component of the CBD. Firstly, since research needs and interests vary widely, it is impracticable to require that the CBD use corpora developed independently of the Dictionary project as-is. Imported corpora must be transformed for use by the CBD, a transformation which is in any case desirable in order to permit the CBD definition of a corpus to be optimized for easy indexing, referencing and display. Secondly, it requires that some care be exercised in tracking changes in references to permit resynchronization of corpora and dictionary without breaking links from articles into the corpora.

The CBD corpus definition is structured according to groups, texts and references. A reference is the standard unit referred to by dictionary articles, and need not correspond systematically with units such as paragraphs and sentences as used in the source corpora (this implies that resynchronizing modified references with the source corpora will probably require case-by-case inspection by the corpus-maintainers).

At group, text and reference levels, keys are supported for the purpose of defining lexical data by genre, period, place and possibly other criteria.

At group, text and reference levels a label is supported. By definition, the label for a reference is the concatenation of the group, text and reference labels.

References are not stored directly under the cbd:text, but rather within a cbd:refgrp element. When a reference is first created or imported, it is the first child of this group, and receives a unique ID. To change a reference, the system must first clone it, and install the clone as the first child of the group; the clone receives a new unique ID and is modified as required. Dictionary articles need store only the unique ID of the reference they are utilizing: the system can notify the dictionary-preparer if the element accessed through the ID is not the first child in its reference group, i.e., if the article is no longer referring to the most recent update of the reference, and appropriate action can be taken (in Sumerian, that may mean, for example, that the reference no longer contains the word or usage which it was believed to when the original article was prepared). Because the modified references are not deleted, works which refer to them continue to function as expected. One could anticipate checkpointing the reference groups at the time of publishing new editions of the dictionary (be it annually or whatever), resynchronizing all articles to refer to the most recent reference, and purging versions of the references which lie between checkpoints. In other words, any given published edition of the dictionary need only contain the most recent versions of references plus the versions of references which appeared in prior published editions of the dictionary.

A reference is stored in the CBD in several forms. A presentation form (cbd:pres) is essentially a cached xhtml version of the reference for display purposes. It is always derived automatically from the transcription given in cbd:data/t, and should be implemented as read-only by user-interfaces. A translation may be cached in the cbd:trans element. If it is imported from another source the attribute XID should be the ID of the source element, and the URL attribute should be the url by which the source element document can be reached.

If an externally derived translation is modified, the EDITED flag should be set to 'y'; this permits subsetting of modified translations to be returned to the corpus-preparers.

If the translation has been checked by the dictionary preparer, the AUDITED flag should be set to 'y'; this permits tracking of in-house reference-checking. Because translations may often be provided by the dictionary preparer, the URL attribute is not required; if the XID attribute is set to 'LOCAL', the URL attribute is ignored.

Finally, the reference contains a data section. This consists of the transcription drawn from the corpus (T), possibly transformed, and a lexical analysis of the sentence (or other unit) suitable for the needs of the dictionary (S). (See lexdata.dtd for further comments) This analysis will be the basis for query operations and must support incremental development. The cbd:data element also supports the attributes EDITED and AUDITED as discussed for cbd:trans.

Dictionary Articles

A sample dictionary article DTD with some explanatory commentary is provided with this paper. This a lightly-modified version of the PSD article DTD, the first version of which was prepared some time ago. I have decided against discussing it in detail in the pre-conference version of this paper in view of anticipated contributions to the meeting.

Bibliography and Abbreviations

For the Sumerian dictionary it is necessary to situate the lexical analysis in its disciplinary context. This leads to the requirement of a self-contained bibliography which is indexed according to lexicographical contributions. Bibliographical items may also be referred to in the field in the dictionary articles.

The bibliography is simplified for easy presentation, but may be stored externally in another format (software is already available to work with the TEI biblStruct). XSL scripts down-translate the external form to the presentation form, and the down-translated form retains the ID of the original external form so that more sophisticated processing of the bibliography could be done by redirecting through the presentation form to the external form.

The use of standard abbreviations (ED for the period Early Dynastic, etc.) entails the need for a list of abbreviations.

The cbd:abblist structure is used for all kinds of abbreviations, including the specification of the bibliography. A list of cbd:abb elements is given, each with an abbreviation element and an expansion element.

Implementation

Implementation of the CBD need not take the form of an explicit, serialized XML document; it is perfectly feasible to generate a persistent DOM (PDOM) directly or feed the CBD components directly to XML-repository software via SAX events. What is important is simply that the relative location of all parts of the CBD is known to all other parts so that XPaths and other query/linking mechanisms can be used easily and robustly.

A proof-of-concept implementation, using the PDOM and CMS from the Ozone project, is intended to be ready by mid-December. Its planned components are detailed below.

CBD-Spec

An XLink based specification which gives a set of links to sources and styles to apply to the linked entities which can then be used to generate the CBD. It is designed to be simple to create with a combination of hand-editing and scripts, and is somewhat analogous to the Makefile in project-management using the programmer's utility `make'.

CBD-Maker

A tool which processes the CBD-Spec and initializes the CBD.

CBD-Cat

A facility for echoing the PDOM to stdout.

CBD-Export

A facility for creating an HTML version of the CBD, split into numerous small chunks to facilitate browsing the CBD over the web. Resolution of references in dictionary articles will be handled using a CGI script; it would also be feasible, though very bulky, to resolve the references statically in the HTML. Re-sorting references by time/place/genre/orthography etc., can be done using XSL in the browser.

A more sophisticated tool would be able to maintain the HTML version incrementally in concert with changes in the PDOM. An alternative would be to serve chunks of the PDOM upon request; this would involve additional server overhead and would not replace generation of static HTML snapshots, as the latter are useful for resource-poor and/or non-networked environments.

CBD-Browser

Really an article browser/dynamic constructor; other parts available as tree view. Tree view allows navigation in superdocument; click-on-article brings article into view pane; simple form/textfield based editing of other parts of superdocument; configuration facility to mark parts of superdocument readonly.

CBD-Editor

Really an article editor/creator; a purpose-built outliner which can handle ID/IDREF issues behind the scenes and allow drag-and-drop sorting of reference sets.

Querying

At present we use a customized indexing and query engine on the PSD project which is optimized for use with Sumerian, where the grapheme/lexeme overlap can be exploited.

For a generic CBD it will be necessary to define generic query methods, possibly using an SQL backend. This issue is subject to further design, discussion and implementation.

Contribution

The entire suite of definitions and tools for the CBD will be offered to the repository.

One issue which deserves some attention is that of coding-standards and tools. Is it feasible and/or desirable to select a set of best-practice language-tools: e.g., Java and Perl? Is it possible to locate predefined coding-standards for the languages to be used in the repository to ensure some degree of uniformity in the code organization? General solutions are much easier to understand if they follow a consistent set of coding practices.

Another issue arises from lexical data tagging and its relationship with dictionary articles. We should specify a set of grammatical terms to cover morphology, syntactic roles and so on. Having done that as the basis of lexical tagging, can we define a typology of usage constraints for dictionary articles, that is, beyond the structure of dictionary articles, can we define the range of constraints that inform the construction of usage sections?