XML tools for managing linguistic data:
The LACITO Archives Project

Michel Jacobson, CNRS/LACITO

CNRS/LACITO
7, rue Guy Moquet, Bat 23, 94800 Villejuif, France
jacobson@idf.ext.jussieu.fr | lacito.vjf.cnrs.fr


The members of the LACITO (Langues et Civilisations à Tradition Orale) research group of the French CNRS have over the past decades amassed large collections of linguistic field data including sound recordings of mainly unwritten languages and annotation in the form of phonological transcriptions, translations, ethnographic notes, etc. The goal of the LACITO archiving project is to conserve this material, and to make it available for research using modern methods.

The principal steps are

Digitization

The materials are of two types, sound recordings and textual annotation.

Digitization of the sound recordings is routine. We archive the sound on CD-ROM, with the usual CD audio parameters of 44.1 Khz, 16 bits (mono or stereo in accordance with the original recording).

Among the textual materials, we will be mainly concerned with transcriptions and translations whose segments are indexed in alignment with the sound recordings. Other annotations, such as word or morpheme glosses, may be structurally dependent on these. Still others may relate to whole texts or to collections of texts -- these might include ethnographic information, dictionaries, etc.

The Unicode standard has been adopted for coding. This coding has the advantages of standardization and relative universality, but problems of rendering arise with some transcriptions. At present we do not have a complete system for rendering Unicode-coded text, but standard IPA transcriptions are rendered satisfactorily. (Some of our texts require Devanagari or Burmese transcriptions which we cannot yet render with Unicode.)

Structuring

The sound recordings have an implicit temporal structure.

The textual annotation is explicitly marked up with tags indicating the type of information represented, e.g. that a given string is a translation, that it corresponds to a given segment of transcription, that the target language is French, etc. The annotation has an essentially hierarchical structure, whose levels are:

Any element of a given level is properly included in an element of the hierarchically superior level, and may have level specific elements and attributes. For example, the text may have attributes such as its language, title, associated sound resource, etc., and include utterances. The utterance, made up of words, may have associated translations, temporal indexes into the sound resource and the identify of the speaker. Words have a transcription, a gloss, etc. (Some of the information currently at the word level will eventually be displaced to the morpheme level.) This structure can be represented as a tree. Although utterances of different speakers may overlap temporally, the marked-up text elements used to represent them do not overlap structurally.

The main requirement for the production of documents with aligned text and sound is the association, with one or more levels of the annotation, of temporal indices into the sound resource. Our spontaneous speech documents are aligned at the utterance level. To facilitate alignment, we have developed SoundIndex, a tool associating a sound editor with a text editor. The user listens to the sound and places index markers on a waveform; the temporal indices are recorded by the program.

The markup language used is XML. On some cases, this markup may be generated (together with Unicode coding) by script from explicitly or implicitly structured text materials produced by the linguist researchers.

Browsing and querying of the archives

The objective of the project is to make the materials accessible to researchers, preserving the link between textual annotation and sound. That is, the transcription and translation may be browsed, with the sound immediately accessible when the user selects an utterance (or other aligned element). Or, in answer to a query, a list of all the occurrences of a given word, in context, may be browsed, and again the sound of the containing utterances is immediately accessible.

The user may choose among predetermined views of the data, which are defined in XSL stylesheets. These offer, for example, either the transcription alone, or the transcription with associated translation in one of the languages into which a translation is available, or transcription with word-for-word aligned interlinear glosses (if available), etc. At present, the stylesheets deliver the view required in the form of HTML, which is interpreted by a standard browser. The associated sound is managed by an applet which is called by a script in the HTML.

The user either browses a single text, or submits a query which defines the elements to be extracted either from a single text or from a group of texts, for example, from all the texts in a given language (information recorded in the XML document headers). The query is passed to a processor which permits the selection of elements of XML documents based on their structural properties. There is no accepted standard query language for XML document, but there are several more or less developed processors currently available. For linguistic work it is important to provide for queries based not only on the identification of structural elements, but also on pattern matching on (Unicode) strings; we use a regular expression package for this purpose.

Specifically linguistic tools, for statistical analysis, concordancing, etc., can be written in any language which provides a DOM or SAX interface to XML.

Dissemination

The architecture adopted relies entirely on web-based technologies. The materials are accessible (experimentally) on a web server, with a CGI interface, and the responses are accessible to any client with a standard browser (with the installation of a few extra elements such as the sound-managing tool). To reduce bandwidth requirements, sound data is disseminated over the web in compressed MP3 format.

To know more about the project: http://lacito.vjf.cnrs.fr/ARCHIVAG/ENGLISH.htm


Linguistic Exploration Workshop