Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
[Addresses the Interlinear Text: Data Models & Archives (Store) need]
Abstract: This paper surveys several software products and standards used for the production of interlinear texts, extracting from them their differences and points in common. The goal of this exercise is to produce a general set of requirements for a data model for interlinear texts that can be used for traditional kinds of interlinear texts and that can be extended for uses not ordinarily thought of as interlinearization, e.g. alignment at the level of the phonological segment. Of the standards surveyed, the PTEXT specification delivers the most complete content model, and the annotation graph model presents a flexible approach to alignment.
An important task in any linguistic field researcher's work is the production of interlinear texts, i.e. annotated documents of a variety of types -- field notes, structured and unstructured collections of utterances illustrating various linguistic phenomena, folk stories, songs, newspaper stories, etc. Annotation may take place on various levels, including phonological, phonetic, morphological, syntactic, and discourse levels, and the integration of these various levels is a non-trivial task. In this paper I survey several markup systems for interlinear text (IT) and derive from them a set of requirements for what a general IT data model should provide. I also suggest some kinds of markup that generalize over the approaches taken in each system.
The paper consists of two primary sections: (1) a brief overview of the kinds of information stored in various IT systems, and how this information is stored in each; (2) a set of requirements for IT that the models employed in these systems imply, along with some suggestions on how these requirements might be implemented.
I survey a variety of software packages and standards used for collecting and annotating IT corpora -- the Berkeley Interlinear Text Collector (BITC), Langues et Civilisations à Tradition Orale (LACITO) XML DTD, Shoebox from SIL, the PTEXT specification, and the annotation graph (AG) model.
I begin with the Berkeley Interlinear Text Collector (BITC) since its model is the least elaborate and thus is useful for illustrating some of the core concepts shared by all the models. BITC was developed as part of the Ingush Language project at UC Berkeley. It has also been used in Field Methods classes at UC Berkeley that want to take advantage of its web-enabled data collection capability. Its data model is too simple to be appropriate for truly large-scale data collection, however.
BITC files are separated into numbered records marked by <record>, which in the default case correspond to sentences, though the S interpretation is not enforced by the model. Ultimately it is up to the user to decide to what kind of descriptive object records correspond, e.g. sentences, paradigms, phrases. Each record is separated into words, marked with the <word> tag. <word> tags are aligned with an <interlin> and optional <note> element, which provide the interlinear gloss and notes at the word level. Alignment is indicated by shared values of the id attribute. Record-level tags provide a way to track various properties of the entire phrase, such as the free translation, source, speaker, the record's compiler, and the compiler's confidence of the transcription and analysis. Some additional special-purpose tags have been provided to allow markup of a translation of King Lear into Ingush, specifically, tags to mark a record as a scene description or stage direction. A simple example record demonstrates how the tags are used and one possible way this record might be displayed.
LACITO is a data archiving project whose goal is to disseminate recorded speech and its transcription. The LACITO XML DTD is slightly more elaborate than the BITC markup. The most important difference between LACITO and BITC markup is that the former provides a mechanism for aligning transcriptions with audio via the <AUDIO> element. The currently available DTD allows the alignment of audio with the <S> element only, though it could easily be modified to allow alignment with both bigger and smaller structural units. Alignment is accomplished through the start and end attributes of the <AUDIO> element.
The remainder of the DTD is broadly reminiscent of BITC's markup: the <S> element corresponds to <record>, and <S> is composed of a series of word/gloss pairs. One minor difference between the two markup styles is that these pairs are aligned by wrapping the word and gloss tags <FORM> and <GLS> in a grouping tag <W> rather than by shared id attribute values. The LACITO DTD also contains an <S>-level element for recording punctuation at the end of a sentence, <PONCT>. In BITC, punctuation is stored as part of the wordform itself.
Shoebox is a data collection and management tool that provides an IT editor as well as many other features. Shoebox stores IT data in text files separated into records. Each record can store a number of fields, including text (transcription), morphemic analysis, glosses, part-of-speech, and a free translation of the entire record. Shoebox uses a different style of markup than LACITO and BITC. Instead of explicitly marking each word/gloss pair, Shoebox uses a line-oriented style in which the text is on a demarcated line, the morphemic representation is on another line, and the glosses and part-of-speech values are on a third and fourth line, respectively. Alignment of the various levels is inferred from the spacing of disparate elements. In this sample of Frisian text from the Shoebox tutorial the text (\t) and morphemic analysis (\m) lines are aligned so that the polymorphemic text item 'Berne' takes as much space on its line as its two analytic counterpart items, 'bern' and '-e', take on their line. The gloss and morpheme elements are lined up with the morphemic analysis in similar fashion.
Besides these differences in markup and alignment styles, the Shoebox format also allows for finer-grained content than LACITO and BITC. Neither LACITO nor BITC provide the morphemic analysis level that Shoebox has, and both conflate Shoebox's gloss (\g) and part-of-speech (\p) levels into a single level.
The PTEXT specification is the most developed data model of all those surveyed, with a comprehensive DTD. An important feature of this DTD is that it includes the possibility of storing hierarchical phrase and morphological structures rather than flat structures only, which is the only kind of structure found in the preceding models. The <ws> (word structure) and <ps> (phrase structure) tags are used for this purpose. It also includes a way to encode recursive feature structures (AVMs), through the organization of <f> (feature) and <fs> (feature structure) nodes.
PTEXT is also notable for being a self-contained format, with a lexicon section defining all the wordforms used in a given text and their morphological analyses. In a text segment wordforms are identified by reference to their occurrence in the lexicon, and the string representing that wordform is not repeated in the text section. In addition, punctuation and capitalization are separated from the wordforms. Like wordforms, punctuation symbols are stored in a special section, and actual occurrences of them in the text are identified by reference to items in that section. Capitalization is stored as an attribute of <w> (word) elements rather than stored as part of the wordform's orthographic representation.
Bird & Liberman's annotation graph (AG) model doesn't have a content model for IT, but it does answer the question of how to align annotations in a different way than the other models surveyed in this paper. Where the other models align elements by compartmentalizing them, i.e. including smaller elements in a larger structure, the AG model is more general. Every element is aligned with respect to a begin and end node in a directed graph. Alignment of annotation elements with respect to one another is obtained by comparing their begin and/or end nodes. Most, and probably all, of the annotations used in the other IT models can be properly translated into an AG representation. This representation has the advantage of generalizing to kinds of alignment not found in traditional IT. The begin and end nodes are ideal for storing time indices of the sort needed to align texts with multimedia files, for example.
A thorough discussion of the AG model is beyond the scope of this paper. An AG representation of the King Lear sample will serve to illustrate the model. This sample uses the AG alignment model with content tags similar to those found in the original BITC markup. Note that individual words and their interlinear glosses are aligned by virtue of having shared <begin> and <end> id attributes. For example "q'ameal" and its gloss "translation" have id attributes that begin at "1" and end at "2". The free translation spans the arc from the begin node "1" and the ending node "3".
Based on this survey of IT software and data models, I have compiled a list of requirements for a general data model for IT. Some of these requirements are made explicit in the documentation of the various projects surveyed. Others are implicit in their data models or can be deduced from their user interfaces, and a few are drawn from personal experience with the Ingush project.
The model must align elements on a variety of linguistic levels.
At a minimum, the model will accommodate all those levels found in the existing IT applications: <w> (wordform, i.e. transcription, which may be orthographic and/ or phonemic); <m> (morphemic analysis); <syn> (syntactic analysis); <gloss> (in one or more languages); <text> (a complete text). Obviously, this list cannot be comprehensive, and some applications will define new content tags, e.g. discourse tags or tags for phonetic and phonological structure, including prosody.
The base of alignment should be flexible.
All of the IT applications surveyed take the text, i.e. the transcription, as the basic element to which other elements, such as morphemes and/or glosses, align. A more general model would not require a traditional transcription as its base. A more flexible approach is to align elements to an abstract time stream rather than to any particular linguistic element or event. Elements aligned with the same point in abstract time will be aligned with respect to each other. Such an approach allows arbitrary linguistic units to be aligned with each other, so applications can identify relationships based on their unique needs. For example, an application might take sentences as a more basic unit than words, or it might be more interested in the level of phonological segments.
The AG model is especially suited to the idea of using abstract time streams as the base of alignment, as its directed graph structure is designed for it.
Alignment must be robust in the face of unknown, incomplete, or wrong information on one or more related levels.
Any analysis, and especially that undertaken as field work, will contain false starts, mis-transcriptions, and incorrect or tentative analyses. It may also run afoul of technical problems, for example, noise in a field recording that obscures a few morphemes but not the meaning of the sentence in which those morphemes appear. It is vital that the model be able to account for gaps in one level without disturbing other levels that are related to the first.
An illustration shows how gaps could be expressed using the AG formalism through the alignment of empty elements.
In addition to these basic requirements, it is also important that the data model supply structures commonly used in linguistic analysis.
The model must provide hierarchical structures for morphological and syntactic analysis.
Tree structures are useful at many levels of linguistic description. In many cases, hierarchical structures may be inferred from placement in an abstract time stream since a parent-child relationship can be deduced from the presence of a larger structure that exactly bounds a group of smaller structures.
This hierarchical structure can be made more explicit as well, in order to distinguish accidental bounding of smaller structures from meaningful bounding. PTEXT's <ws> and <ps> labels, for example, explicitly identify certain structures as being in a hierarchical relationship. Since tree structures are so common, we can further generalize these labels as 'analytic trees', of which word structure and phrase structure are particular types targetting different levels of linguistic description. The <ws> and <ps> labels would be equivalent to the notation <at level={w|p}>, where at represents the analytic tree and w and p represent 'word' and 'phrase' level, respectively. Hierarchical structures for other levels could be added simply by defining new values for the level attribute.
The model must provide for recursive feature structures (AVMs).
The PTEXT DTD provides for these through the <f> and <fs> elements.
The model must provide a way to express tentative hypotheses and ambiguous analytic structures.
The PTEXT DTD also provides for these through the <*Alt> elements (e.g. <psAlt>).
The model should provide a way to define null objects.
Null objects are useful for a variety of reasons, especially if there is more than one type of null object. For example, an application might want to distinguish an unsuccessful attempt at morphemic analysis from an analysis that is simply missing. Another application might want to use null values as theoretical objects (e.g. PRO vs. pro). Not all theories of analysis use empty categories, but for for those that do, this kind of object would prove useful. If null objects were omitted from the data model, applications could provide them on an ad hoc basis by defining a special string as the null object, but this approach may not be satisfactory. In particular, it is undesirable to alter a published text by inserting strings into it. Also, the presence of these strings could interfere with certain types of queries.
The null object problem is related to the problem of gaps raised in (3), and the same solution will apply to both.
The model must allow annotation with respect to multiple target languages.
The implementation of this requirement is relatively simple, as it only requires <gloss> elements to have a lang attribute.
The model must provide free-form notes at any level of analysis.
A researcher's notes may target any part of the annotation, from a small phonetic detail of a particular wordform to the annotator's general impressions of the text as a whole.
The model must allow for the tracking and transcription of an arbitrary number of overlapping interlocutors.
Multiple interlocutors may be needed for the annotation of most kinds of text, especially annotation of spontaneous speech.
External references provide a way to relate a text or portions of text with non-local elements. In some cases, this refers to external multimedia files or texts, or it may refer to entities defined in a data file, such as a lexicon. In addition, it is important that the model work in a way that enhances the reliability of pointers into the text being interlinearized.
The model must allow for the synchronization of the source text with its translation(s) and with multimedia files.
Current technology provides great possibilities for the integration of sources in the form of audio and other multimedia files and their annotation. Again, the AG model is well-suited to this task, as it allows ordered annotations to be aligned with respect to multimedia files without requiring that every annotation have a precise alignment in time with respect to these files.
The model should provide some integration with other possible views of a text not directly related to linguistic analysis.
The underlying text has a structure of its own independent of the linguist's desire to analyze its linguistic structure, and it is necessary the data model provide a method of referring to the underlying textual structure.
At a minimum, pointers to external references should be provided. For instance, the Ingush example taken from a translation of King Lear could be enhanced with a reference to the scene in which it appears. This merging of textual structure with linguistic analysis could be taken a step farther by simultaneously marking up the text as IT while also following a DTD designed for describing the structure of plays. The tracking of parallel markup structures must be handled by applications, however, and cannot be incorporated into the IT data model directly. IT is about alignment and cannot hope to provide markup for the unlimited range of text types that might be interlinearized. The IT data model can at least take the minimal step of providing pointers to other structures, however, as the need for references is quite common. Shoebox and PTEXT, for instance, have ways to identify records as specific Biblical verses.
The model must provide unique, durable reference markers to be placed at any point in the source text.
Nodes in the data structure must be uniquely identifiable. Both internal and external references to these identifiers may exist, and it is crucial that they not change once established. Revision of a text potentially interferes with this durability of reference, particularly if the revision involves insertion or deletion of a node. This problem is made more difficult if node identifiers are also used to encode the relative order of the nodes, as they do in the BITC model (and appear to in the AG model). It is preferable to separate the identification function from the ordering function so that the former can remain the same even when the latter changes. An illustration of what unique element identifiers might look like in the AG format makes use of an idref attribute on arcs.
The separation of identification and ordering functions does not entirely solve the problem, and some kind of version control system will be needed to ensure that references to deleted structures can be recovered, and to ensure that identifiers of deleted structures aren't reused on inserted structures.
The model should provide the possibility of storing the representation of a wordform in a text once only, with each instance of that wordform in the text instantiated by a reference to the normalized representation.
By allowing wordforms to be instantiated with a reference, the task of updating the representation of a wordform is greatly simplified, as a change to a normalized wordform is automatically propagated to all forms that refer to it. While this feature of the data model is certainly useful, the decision as to whether representations should be repeated for each occurrence or whether they should appear by reference should be made at the application level. Thus, in order to be maximally general, the data model should allow these kinds of references but should not require them.
The model should allow strings to be stored in canonical form while preserving information regarding their graphical representation in context.
By storing strings in a canonical form, operations such as querying can be simplified. For example, a user could search for 'baker' to retrieve all instances of references to the word 'baker' (the occupation) and could search for 'Baker' to retrieve all instances of 'Baker' (the name). If the wordforms are stored in canonical form, even sentence-initial instances of 'baker' will be returned correctly. The representation of sentence-initial 'baker' might look like the following:
<form cap="init">baker</form>
The application deduces the proper display representation from the wordform and its cap attribute. Again, the data model should provide the possibility for normalization but should not enforce it.
Another important part of the text representation is punctuation. Some of the software surveyed above stores punctuation in the same field as the wordform. A cleaner approach is the one taken by LACITO and PTEXT, where punctuation is explicitly identified as a separate element.
In summary, the PTEXT content model already provides tags that are general enough to cover the kinds of markup found in the other content formats surveyed and satisfies many of the requirements listed. It also provides features the other markup systems do not, including hierarchical structures and feature structures. Some of these can be generalized even further. For example, a generalized tree structure could be created to cover all types of hierarchical structure.
The AG alignment model provides a very flexible, general approach to alignment, which is a crucial part of creating interlinear texts. Since this model is compatible with the PTEXT content model, a merging of the two models produces a potent combination -- a well-defined model for representing IT that can readily be extended to other levels of linguistic structure than are found in traditional interlinear texts.