A Linguistic Archive on the Web

Michel Jacobson and Boyd Michailovsky
LACITO/CNRS
7 rue Guy Môquet
94800 Villejuif, France.

jacobson@idf.ext.jussieu.fr
boydm@vjf.cnrs.fr
 

Abstract

The LACITO Archive contains sound recordings of connected speech in little-known languages with associated annotation (transcriptions, glosses, translations), including time-alignment data linking the annotation segment by segment to the recordings. When a document is browsed or queried, simultaneous access is provided to the recorded sound.

The fundamental technical options adopted in designing the archive are the use of standard XML and XML tools, Unicode, and standard browsers. All archived text materials are coded in XML, conformant to a DTD. Views on the documents when they are browsed or queried are defined by XSLT stylesheets, which select and format material from the text database and return it to clients running standard web browsers.

All software developed for the archive is open-source. We present three aspects of the archive here: (1) the document data structure, defined by a DTD, (2) the use of XSLT as a query language, and (3) the client-server architecture, including the software which permits the client browser to access the sound resource.

1. Introduction

The LACITO Archive contains sound recordings and associated annotation (transcription, translations, etc.) of connected linguistic texts (e.g. oral literature, personal narratives, conversations) in little-known, often endangered languages. The purpose of the archive is to conserve these documents in a format adapted to modern research methods, and to make both the documents and the tools developed for their exploitation widely available to the academic and other interested communities.

The archived materials -- text and sound -- are accessed using standard web browsers. Browsing gives access to a choice of views on the data (transcription with or without translations or interlinear glosses, translations in either French or English), either sequentially or in response to other queries (all utterances containing a selected morpheme, etc.). In all cases, simultaneous access is provided to the recorded sound.

The LACITO Archive at present comprises a hundred or so texts in a score of languages, mainly languages of New Caledonia and of Nepal, and is accessible on the LACITO intranet. A dozen texts in five languages are available for public browsing on the archive website (http://lacito.archive.vjf.cnrs.fr). These texts and and the archive software can be downloaded for local use.

In the present paper we present the methods used for structuring and querying the archive. The documents are marked-up in XML (eXtensible Markup Language). We begin with the DTD (Document Type Definition), which completely defines their structure. We then present the XSLT (eXtensible Stylesheet Language Transformations) stylesheets which define the user interface, select and organize the data to be furnished in response to queries, and define its format. Finally we present the architecture of the system which makes the documents accessible to web clients using a standard browser.

2. Document structure: the DTD

The "LACITO DTD" defines a document type which we have called TEXT. (See sample 3-word TEXT.) A TEXT document is segmented into a hierarchy of levels, and each annotation is associated with a level. Contrary to common usage, we consider transcription as a kind of annotation -- this has to do with the nature of field linguistic study of unwritten languages. It is true that one of the transcriptions usually serves as the basis for the rest of the annotation.

The levels of segmentation recognized by the DTD are the <TEXT>, the utterance <S>, the word <W> and the morpheme <M>. The number and kind of levels of segmentation, and the appropriateness of assigning a particular type of annotation to a particular level, may depend on the purpose and the presuppositions of the linguist as well as on the nature of the language and the genre of the document; we would not necessarily expect to impose a standard here. Our current annotation is designed to facilitate the study of the documents as connected text, and as a resource for lexicography. The same sound resources and perhaps basic transcriptions could be exploited for phonetic research, but this would require a different annotation.

2.1 Metadata

Metadata -- essentially cataloguing information concerning the document as a whole -- is contained in a <HEADER> element at the TEXT-level. It is used by our user interface (defined in an XSLT stylesheet) to give access to the archive according to the documents' country of origin, language, and title. Our present, rudimentary cataloguing data structure is a temporary expedient which will be replaced when an appropriate common standard is agreed upon.

2.2 Time-alignment

The <AUDIO> element, used for time-anchoring, is empty, with the start and end offsets of the time-anchored segments coded as attributes. In principle, segments of any size can be time-anchored. We have chosen to time-anchor <S> elements as the smallest units of connected text: <S id="hayu1s1"><AUDIO start="2.3656" end="7.9256"/>...</S>.

SoundIndex2, a tool which associates a text editor with a sound editor, has been developed in Tcl/Tk to facilitate the process of time-alignment on a variety of platforms. It takes as input any well-formed XML document regardless of DTD and associates time-anchoring elements user-specified text elements (e.g. <S>, <W>, <FOO>). Document import and export is handled by an XML parser to guarantee well-formedness. (Sample screen.)

2.3 Transcription

Transcription is contained in <FORM> elements, which could in principle be associated with any level. In general, authors have not provided separate S- (or higher-)level transcriptions, this function being filled by the concatenation of W-level transcriptions, but one author has chosen to document fast-speech and sandhi phenomena in an S-level transcription. Most authors have provided a basic W-level transcription of the phonological word. An additional transcription at the M-level, in morphophonological transcription where this is useful, has been provided for some documents. This facilitates lexical searches. Form-class (etc.) information may be specified in the type attribute of the <M> element.

Many of our early documents were simply computerized versions of earlier publications in linguists' traditional 3-line interlinear format. In these "legacy" documents, only W-level transcription is provided, with some separable morphemes implicitly marked up by hyphens. This is adequate for straightforward browsing, but for lexical searches it is clearly preferable to have "go" explicitly marked up as an <M> element than to have to extract it from a <W> element like "going" or even "go-ing".

In our Nepal texts, loanwords from the national language are marked up as <FOREIGN> elements. A similar mechanism could be used to indicate personal or place-names, etc. This markup has the advantage of being usable when the marked-up item does not constitute a whole text segment -- for example when the loan is only part of a <W> element. In an M-level transcription, such items could be marked up using the type attribute of the <M> element.

2.4 Translations

Like transcriptions, translations (<TRANSL> elements) could be provided for at any level. In our documents, the S-level free translations are fundamental, on the assumption that these can be strung together to yield an acceptable TEXT-level translation. Unlike transcriptions, there is no question of concatenating W- (or M-)level glosses to make a higher-level translation.

In general, lower-level glosses (also marked up as <TRANSL> elements) are provided (if at all) at only one of the W- or M- levels. In our texts, W-level glosses are often in relatively plain language (<W><FORM>hiptu</FORM><TRANSL>he.hit.him</TRANSL></W>) whereas M-level glosses tend to be more technical:
<M><FORM type="pastem">hipt</FORM><TRANSL>hit</TRANSL></M>
<M><FORM type="vsuffix">u</FORM><TRANSL type="meta">3obj</TRANSL></M>
Translations have the attribute lang for the target language, and may have the further attribute type="meta" to indicate metalanguage glosses, traditionally formatted in small caps.

2.5 Other issues

Speech turns have not been marked up as explicit elements, but the speaker of an <S> may be indicated in a who attribute. Temporal overlap (the result of overlapping speech turns) cannot, in XML, be indicated by overlapping <S> elements; it is implied by overlapping time offset values:
        <S id="s1" who="A"><AUDIO start="0.00" end="2.00"/>
                <FORM>I haven't finished.</FORM></S>
        <S id="s2" who="B"><AUDIO start="1.00" end="3.00"/>
                <FORM>I've already started.</FORM></S>
Punctuation is somewhat problematic. It clearly has a place in orthographic transcription. Its place in phonological transcription is more questionable, but it does improve the readability of an <S> or larger element. It can be considered as a rudimentary kind of intonation-marking. We have invented an ad hoc coding for punctuation marks as independent <PUNC> elements which are hierarchical siblings of <W> elements. These can be ignored in linguistic processing.

3. Using XSLT as a query language

All processing of XML documents is performed using XSLT "stylesheets" and standard XSLT processors. The XSL extension mechanism is used to add functions to XSL where necessary.

The LACITO DTD defines a range of text annotations including what is sometimes called "interlinear text" -- which we regard as simply one possible view on multilingual glossed text. This and other views on XML documents structured according to the LACITO DTD are defined in XSLT stylesheets, which are available on our website. Here we will present XSLT stylesheets which define more general queries, extracting data and processing it in a flexible way.

An XML document is a text database with a tree structure. Operations like selecting, sorting and counting data -- defined for relational databases in languages like SQL -- may be defined for an XML document in XSLT. An XSLT stylesheet defines a transformation of an XML tree into a result tree, which may also contain formatting information such as (in the present case) HTML markup.

3.1 Data selection

  1. Example 1: To list all the occurring morphemes in a document, we define an <xsl:template> element which searches the XML tree starting from the root until it reaches the <M> nodes. This is expressed by the XPath expression .//W/M/FORM , supplied as the value of the select attribute of the <xsl:template> element. The <xsl:copy-of> element then copies each selected node to the result tree.

3.2 Sorting

  1. Example 2: Once the required data is selected, further processing can be envisaged. To sort the data, an <xsl:sort> element can be added to the stylesheet, producing a new, sorted result tree. XSLT provides the usual sorting options for numerical data and for character data in a variety of languages.
Linguists are never satisfied with the usual sort options, even sorting and resorting data to test phonological hypotheses. Since XSLT provides for extensions, we have added a function which facilitates the use of parameter files to define sort keys. The functions are written in Java, which provides reliable handling of Unicode strings.
  1. Example 3: To sort in the order defined in a parameter file, we have defined the Java function mysort.mysortvalue(String s, String level). To call this function via the <xsl:sort> element, we need only replace the value . of the select attribute by the value number(java:mysort.mysortvalue(string(.), string('1'))). The <xsl:sort> element can be repeated for as many keys as are required, producing a new result tree. (The sort in the example classes aspirated initials after unaspirated ones and ignores vowel prosodies except to distinguish homographs.)
A further extension is required to suppress duplicate items from the result tree.
  1. Example 4: To suppress duplicates, we have defined the Java function mydistinct.distinct(NodeIterator ni). All of the values of .//W/M/FORM are placed in the variable forms. In the element <xsl:for-each>, we replace the value .//W/M/FORM with the value java:mydistinct.distinct($forms), producing a new result tree.

3.3 Formatting HTML output

  1. Example 5: To prepare our wordlist for presentation to the client web browser, we add HTML tags (<html>, <head>, <body>, <ul>, <li>, etc.) to our output. The new result tree can be interpreted by a web browser.

3.4 Browsing and further querying

  1. Example 6: We would like the user to be able to follow up by selecting an item from the list of distinct morphemes and passing it as a parameter to a stylesheet which defines a further processing operation, in this case a morpheme search. To this end, we place each list item in an anchor. When an item is selected by the user, its value is passed to the find.xsl stylesheet, and the XSLT processor is called to execute the search. The anchored list items have the following form: <a href="/servlets/xslservlet?XML=data.xml&XSL=find.xsl&WHAT=xyz">xyz</a>.

4. Architecture for web dissemination

For disseminating the archived material we have chosen to rely as far as possible on standard web technology, adopting the web browser as the unique access tool. Standard web technology -- browsers and HTML -- gives immediate access to the following capacities, in principle independently of the platform: Web dissemination implies a division of labor between a web server and a web client. The functions of each are listed below; their interaction is represented schematically in a rough flow-chart.

4.1 Server-side

All data treatments on the server that were previously (cf. Jacobson et al. 2001) handled by Perl scripts, XML-QL, etc., are now handled by XSLT stylesheets. The CGI interface has been replaced by the servlet and by an XSLT-defined interface.

4.2 Client-side

The applet and the JavaScript script are explained in detail in Jacobson et al. 2001.

Conclusion

The key to the LACITO Archive project has been the use of standard data formats and structures, in particular Unicode and XML, which have given access to the use of standard multi-platform software tools. This has permitted us to concentrate our own efforts on designing parameters to adapt standard software to our particular needs rather than on developing generic applications.

All archive documents are structured in XML. We have presented the LACITO DTD, which covers a range of text annotations including time-alignment, transcriptions and translations at different levels of segmentation (granularities). This DTD has a few idiosyncrasies related to particular linguists' analyses of particular language data, but its general structure is straightforward. Adherence to the principle of logically structured text makes it relatively easy to transform the markup if desired. Our only authoring tool, SoundIndex, is DTD-independent.

After experimenting with CGI scripts, non-standard XML query languages, etc., we are currently handling all data processing using an XSLT processor and stylesheets. Here again, standardization has led to the availability of excellent XSLT processors which can be extended to handle additional functions as needed. In the present paper, we have concentrated on stylesheets which define data extraction and ordering operations in XSLT -- that is, on the use of XSLT as a query language. This is the area in which we have made the most progress recently. We have presented for the workshop repository an example of a morpheme index, progressively incorporating minor extensions to the XSLT standard. In the future, we expect to pursue these techniques in linking text documents to lexicons. Two lexicons of Nepal languages have recently been converted to XML in prototype form.

Finally, in the area of web-dissemination we have adopted the web browser as our unique access tool and added the functions required for exploiting sound resources, an area not yet covered by web standards. We have presented the software elements which constitute the web architecture of the LACITO Archive above.

Reference:
Jacobson, Michel, Boyd Michailovsky, John B. Lowe. 2001. Linguistic documents synchronizing sound and text. Speech annotation and corpus tools. Special issue: Speech Communication. [Preprint: http://lacito.archivage.vjf.cnrs.fr/article.pdf]