A Linguistic Archive on the Web
Michel Jacobson and Boyd Michailovsky
LACITO/CNRS
7 rue Guy Môquet
94800 Villejuif, France.
jacobson@idf.ext.jussieu.fr
boydm@vjf.cnrs.fr
Abstract
The LACITO Archive contains sound recordings of connected speech in little-known
languages with associated annotation (transcriptions, glosses, translations),
including time-alignment data linking the annotation segment by segment
to the recordings. When a document is browsed or queried, simultaneous
access is provided to the recorded sound.
The fundamental technical options adopted in designing the archive are
the use of standard XML and XML tools, Unicode, and standard browsers.
All archived text materials are coded in XML, conformant to a DTD. Views
on the documents when they are browsed or queried are defined by XSLT stylesheets,
which select and format material from the text database and return it to
clients running standard web browsers.
All software developed for the archive is open-source. We present three
aspects of the archive here: (1) the document data structure, defined by
a DTD, (2) the use of XSLT as a query language, and (3) the client-server
architecture, including the software which permits the client browser to
access the sound resource.
1. Introduction
The LACITO Archive contains sound recordings and associated annotation
(transcription, translations, etc.) of connected linguistic texts (e.g.
oral literature, personal narratives, conversations) in little-known, often
endangered languages. The purpose of the archive is to conserve these documents
in a format adapted to modern research methods, and to make both the documents
and the tools developed for their exploitation widely available to the
academic and other interested communities.
The archived materials -- text and sound -- are accessed using standard
web browsers. Browsing gives access to a choice of views on the data (transcription
with or without translations or interlinear glosses, translations in either
French or English), either sequentially or in response to other queries
(all utterances containing a selected morpheme, etc.). In all cases, simultaneous
access is provided to the recorded sound.
The LACITO Archive at present comprises a hundred or so texts in a score
of languages, mainly languages of New Caledonia and of Nepal, and is accessible
on the LACITO intranet. A dozen texts in five languages are available for
public browsing on the archive website (http://lacito.archive.vjf.cnrs.fr).
These texts and and the archive software can be downloaded for local use.
In the present paper we present the methods used for structuring and
querying the archive. The documents are marked-up in XML (eXtensible Markup
Language). We begin with the DTD (Document Type Definition), which completely
defines their structure. We then present the XSLT (eXtensible Stylesheet
Language Transformations) stylesheets which define the user interface,
select and organize the data to be furnished in response to queries, and
define its format. Finally we present the architecture of the system which
makes the documents accessible to web clients using a standard browser.
2. Document structure: the DTD
The "LACITO DTD" defines a document type which we
have called TEXT. (See sample 3-word TEXT.)
A TEXT document is segmented into a hierarchy of levels, and each annotation
is associated with a level. Contrary to common usage, we consider transcription
as a kind of annotation -- this has to do with the nature of field linguistic
study of unwritten languages. It is true that one of the transcriptions
usually serves as the basis for the rest of the annotation.
The levels of segmentation recognized by the DTD are the <TEXT>,
the utterance <S>, the word <W> and the morpheme
<M>. The number and kind of levels of segmentation, and the
appropriateness of assigning a particular type of annotation to a particular
level, may depend on the purpose and the presuppositions of the linguist
as well as on the nature of the language and the genre of the document;
we would not necessarily expect to impose a standard here. Our current
annotation is designed to facilitate the study of the documents as connected
text, and as a resource for lexicography. The same sound resources and
perhaps basic transcriptions could be exploited for phonetic research,
but this would require a different annotation.
2.1 Metadata
Metadata -- essentially cataloguing information concerning the document
as a whole -- is contained in a <HEADER> element at the TEXT-level.
It is used by our user interface (defined in an XSLT stylesheet) to give
access to the archive according to the documents' country of origin, language,
and title. Our present, rudimentary cataloguing data structure is a temporary
expedient which will be replaced when an appropriate common standard is
agreed upon.
2.2 Time-alignment
The <AUDIO> element, used for time-anchoring, is empty, with
the start and end offsets of the time-anchored segments coded as attributes.
In principle, segments of any size can be time-anchored. We have chosen
to time-anchor <S> elements as the smallest units of connected
text: <S id="hayu1s1"><AUDIO start="2.3656" end="7.9256"/>...</S>.
SoundIndex2, a tool which associates a text editor with a sound editor,
has been developed in Tcl/Tk to facilitate the process of time-alignment
on a variety of platforms. It takes as input any well-formed XML document
regardless of DTD and associates time-anchoring elements user-specified
text elements (e.g. <S>, <W>, <FOO>). Document import
and export is handled by an XML parser to guarantee well-formedness. (Sample
screen.)
2.3 Transcription
Transcription is contained in <FORM> elements, which could
in principle be associated with any level. In general, authors have not
provided separate S- (or higher-)level transcriptions, this function being
filled by the concatenation of W-level transcriptions, but one author has
chosen to document fast-speech and sandhi phenomena in an S-level
transcription. Most authors have provided a basic W-level transcription
of the phonological word. An additional transcription at the M-level, in
morphophonological transcription where this is useful, has been provided
for some documents. This facilitates lexical searches. Form-class (etc.)
information may be specified in the type attribute of the <M>
element.
Many of our early documents were simply computerized versions of earlier
publications in linguists' traditional 3-line interlinear format. In these
"legacy" documents, only W-level transcription is provided, with some separable
morphemes implicitly marked up by hyphens. This is adequate for straightforward
browsing, but for lexical searches it is clearly preferable to have "go"
explicitly marked up as an <M> element than to have to extract
it from a <W> element like "going" or even "go-ing".
In our Nepal texts, loanwords from the national language are marked
up as <FOREIGN> elements. A similar mechanism could be used
to indicate personal or place-names, etc. This markup has the advantage
of being usable when the marked-up item does not constitute a whole text
segment -- for example when the loan is only part of a <W>
element. In an M-level transcription, such items could be marked up using
the type attribute of the <M> element.
2.4 Translations
Like transcriptions, translations (<TRANSL> elements) could
be provided for at any level. In our documents, the S-level free translations
are fundamental, on the assumption that these can be strung together to
yield an acceptable TEXT-level translation. Unlike transcriptions, there
is no question of concatenating W- (or M-)level glosses to make a higher-level
translation.
In general, lower-level glosses (also marked up as <TRANSL>
elements) are provided (if at all) at only one of the W- or M- levels.
In our texts, W-level glosses are often in relatively plain language (<W><FORM>hiptu</FORM><TRANSL>he.hit.him</TRANSL></W>)
whereas M-level glosses tend to be more technical:
<M><FORM type="pastem">hipt</FORM><TRANSL>hit</TRANSL></M>
<M><FORM type="vsuffix">u</FORM><TRANSL type="meta">3obj</TRANSL></M>
Translations have the attribute lang for the target language,
and may have the further attribute type="meta" to indicate metalanguage
glosses, traditionally formatted in small caps.
2.5 Other issues
Speech turns have not been marked up as explicit elements, but the
speaker of an <S> may be indicated in a who attribute.
Temporal overlap (the result of overlapping speech turns) cannot, in XML,
be indicated by overlapping <S> elements; it is implied by
overlapping time offset values:
<S id="s1" who="A"><AUDIO start="0.00" end="2.00"/>
<FORM>I haven't finished.</FORM></S>
<S id="s2" who="B"><AUDIO start="1.00" end="3.00"/>
<FORM>I've already started.</FORM></S>
Punctuation is somewhat problematic. It clearly has a place in orthographic
transcription. Its place in phonological transcription is more questionable,
but it does improve the readability of an <S> or larger element.
It can be considered as a rudimentary kind of intonation-marking. We have
invented an ad hoc coding for punctuation marks as independent <PUNC>
elements which are hierarchical siblings of <W> elements. These
can be ignored in linguistic processing.
3. Using XSLT as a query language
All processing of XML documents is performed using XSLT "stylesheets" and
standard XSLT processors. The XSL extension mechanism is used to add functions
to XSL where necessary.
The LACITO DTD defines a range of text annotations including what is
sometimes called "interlinear text" -- which we regard as simply one possible
view on multilingual glossed text. This and other views on XML documents
structured according to the LACITO DTD are defined in XSLT stylesheets,
which are available on our website. Here we will present XSLT stylesheets
which define more general queries, extracting data and processing it in
a flexible way.
An XML document is a text database with a tree structure. Operations
like selecting, sorting and counting data -- defined for relational databases
in languages like SQL -- may be defined for an XML document in XSLT. An
XSLT stylesheet defines a transformation of an XML tree into a result tree,
which may also contain formatting information such as (in the present case)
HTML markup.
3.1 Data selection
-
Example 1: To list all the occurring morphemes in
a document, we define an <xsl:template> element which searches
the XML tree starting from the root until it reaches the <M>
nodes. This is expressed by the XPath expression .//W/M/FORM ,
supplied as the value of the select attribute of the <xsl:template>
element. The <xsl:copy-of> element then copies each selected
node to the result tree.
3.2 Sorting
-
Example 2: Once the required data is selected, further
processing can be envisaged. To sort the data, an <xsl:sort>
element can be added to the stylesheet, producing a new, sorted result
tree. XSLT provides the usual sorting options for numerical data and
for character data in a variety of languages.
Linguists are never satisfied with the usual sort options, even sorting
and resorting data to test phonological hypotheses. Since XSLT provides
for extensions, we have added a function which facilitates the use of parameter
files to define sort keys. The functions are written in Java, which provides
reliable handling of Unicode strings.
-
Example 3: To sort in the order defined in a parameter
file, we have defined the Java function mysort.mysortvalue(String
s, String level). To call this function via the <xsl:sort>
element, we need only replace the value . of the select
attribute by the value number(java:mysort.mysortvalue(string(.), string('1'))).
The <xsl:sort> element can be repeated for as many keys as
are required, producing a new result tree. (The sort
in the example classes aspirated initials after unaspirated ones and ignores
vowel prosodies except to distinguish homographs.)
A further extension is required to suppress duplicate items from the result
tree.
-
Example 4: To suppress duplicates, we have defined
the Java function mydistinct.distinct(NodeIterator
ni). All of the values of .//W/M/FORM are placed in the
variable forms. In the element <xsl:for-each>, we
replace the value .//W/M/FORM with the value java:mydistinct.distinct($forms),
producing a new result tree.
3.3 Formatting HTML output
-
Example 5: To prepare our wordlist for presentation
to the client web browser, we add HTML tags (<html>, <head>,
<body>, <ul>, <li>, etc.) to our output. The new result
tree can be interpreted by a web browser.
3.4 Browsing and further querying
-
Example 6: We would like the user to be able to
follow up by selecting an item from the list of distinct morphemes and
passing it as a parameter to a stylesheet which defines a further processing
operation, in this case a morpheme search. To this end, we place each list
item in an anchor. When an item is selected by the user, its value is passed
to the find.xsl stylesheet, and the XSLT processor is called to
execute the search. The anchored list items have the following form: <a
href="/servlets/xslservlet?XML=data.xml&XSL=find.xsl&WHAT=xyz">xyz</a>.
4. Architecture for web dissemination
For disseminating the archived material we have chosen to rely as far as
possible on standard web technology, adopting the web browser as the unique
access tool. Standard web technology -- browsers and HTML -- gives immediate
access to the following capacities, in principle independently of the platform:
-
unlimited distribution without a physical support.
-
rendering including Unicode support.
-
multimedia.
-
interactive user interface.
Web dissemination implies a division of labor between a web server and
a web client. The functions of each are listed below; their interaction
is represented schematically in a rough flow-chart.
4.1 Server-side
-
The archived XML documents are stored on the server.
-
The sound resources are stored on the server, in any one of the usual formats
(AIFF, WAV, AVI, etc.). We currently use a compressed format, MPEG2 Layer
3 (*.MP3), to minimize transfer time.
-
XSLT stylesheets (which are also XML documents) are stored on the server.
-
The only specific processing program installed on the server is a Java
servlet which calls the Apache project's Xalan
XSLT processor. The XSLT processor applies the XSLT stylesheets to the
XML documents, outputting HTML which can be interpreted directly by a web
browser.
All data treatments on the server that were previously (cf. Jacobson et
al. 2001) handled by Perl scripts, XML-QL, etc., are now handled by XSLT
stylesheets. The CGI interface has been replaced by the servlet and by
an XSLT-defined interface.
4.2 Client-side
-
The client-side browser interprets and displays the HTML produced by the
servlet.
-
A Java applet declared in the HTML handles the
sound resource, downloading it from the server and executing it in the
browser's Java Virtual Machine environment.
-
A JavaScript script, also declared in the HTML,
handles communication between the applet and the browser. Progress reports
from the applet as it executes the sound ("beginning of segment", "end
of segment", "end of play") are received by the script and trigger display
events (highlighting of appropriate text segments).
-
When the user selects anchored text elements in the display, the browser
activates links in the anchor which transmit requests for further treatments
to the servlet or to the applet.
The applet and the JavaScript script are explained in detail in Jacobson
et al. 2001.
Conclusion
The key to the LACITO Archive project has been the use of standard data
formats and structures, in particular Unicode and XML, which have given
access to the use of standard multi-platform software tools. This has permitted
us to concentrate our own efforts on designing parameters to adapt standard
software to our particular needs rather than on developing generic applications.
All archive documents are structured in XML. We have presented the LACITO
DTD, which covers a range of text annotations including time-alignment,
transcriptions and translations at different levels of segmentation (granularities).
This DTD has a few idiosyncrasies related to particular linguists' analyses
of particular language data, but its general structure is straightforward.
Adherence to the principle of logically structured text makes it relatively
easy to transform the markup if desired. Our only authoring tool, SoundIndex,
is DTD-independent.
After experimenting with CGI scripts, non-standard XML query languages,
etc., we are currently handling all data processing using an XSLT processor
and stylesheets. Here again, standardization has led to the availability
of excellent XSLT processors which can be extended to handle additional
functions as needed. In the present paper, we have concentrated on stylesheets
which define data extraction and ordering operations in XSLT -- that is,
on the use of XSLT as a query language. This is the area in which we have
made the most progress recently. We have presented for the workshop repository
an example of a morpheme index, progressively incorporating minor extensions
to the XSLT standard. In the future, we expect to pursue these techniques
in linking text documents to lexicons. Two lexicons of Nepal languages
have recently been converted to XML in prototype form.
Finally, in the area of web-dissemination we have adopted the web browser
as our unique access tool and added the functions required for exploiting
sound resources, an area not yet covered by web standards. We have presented
the software elements which constitute the web architecture of the LACITO
Archive above.
Reference:
Jacobson, Michel, Boyd Michailovsky, John B. Lowe. 2001. Linguistic
documents synchronizing sound and text. Speech annotation and corpus
tools. Special issue: Speech Communication. [Preprint: http://lacito.archivage.vjf.cnrs.fr/article.pdf]