TEC: A TOOLKIT AND APPLICATION PROGRAM INTERFACE FOR DISTRIBUTED CORPUS PROCESSING

Saturnino Luz and Mona Baker
NIS Laboratory, Denmark and
Centre for Translation and Intercultural Studies, UMIST, UK
luzs@acm.org and mona@ccl.umist.ac.uk

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. We present TEC, a set of corpus indexing, storage and retrieval tools developed as part of the Translational English Corpus project, held at UMIST's Centre for Translation and Intercultural Studies. The TEC API is based on a distributed software architecture which aims at allowing uniform and seamless access of language resources to large user groups over the web. The toolkit is implemented in Java, using XML for corpus markup and standard Internet protocols for data transfer. Our presentation of the software is followed by a discussion of issues relating to metadata and our vision of how a network of small to medium size corpora (or corpus service providers) can be built and accessed over the Internet.


1. INTRODUCTION

Our contribution to this workshop is based on our experience in building a corpus of translated text and designing the software to allow web-based access to it. In what follows, we will briefly describe both, and focus on issues related to metadata and system architecture, specially scalability and distribution of data and processing resources. The corpus in question, The Translational English Corpus (TEC), has been collected as part of an ongoing effort to apply corpus analysis to the study of the linguistic behaviour of professional translators. A set of tools for corpus indexing, storage and retrieval tools has been developed during the project.

TEC was idealised as a framework for distributed access (i.e. multiple clients) to distributed corpora (i.e. multiple servers), possibly relying on a centralised metadata repository for resource discovery. The constraints that motivated the design of TEC stem from the composition of the data. TEC is a corpus of written texts translated into English from a variety of source languages. It has been designed to allow researchers to study the distinctive nature of translated text (Baker, 1999). Since most of the material that forms part of TEC consists of translated fiction, copyright restrictions rule out direct access to the text. We believe that splitting the processing tasks between clients and servers, and restricting repositories to serve only concordance lines provides an efficient and scalable solution that satisfies copyright constraints.

2. CORPORA AND TRANSLATION STUDIES

A number of people in translation studies have recently been trying to develop a basic framework for studying translation from a theoretical angle: from the angle of identifying what is unique about translation, rather than the pedagogical angle of trying to show how translation performance might be improved. The methodology involves building corpora of translated (rather than original) text in a given language, say English, and attempting to identify distinctive patterning in this corpus - distinctive by comparison with a corpus of original text production in the same language. The Translational English Corpus supports that methodology by providing a computerised collection of authentic, published translations into English from a variety of source languages and by a wide range of professional translators.

TEC provides the basis for investigating a range of issues related to the distinctive nature of translated text, the style of individual translators, the impact of individual source languages on the patterning of English, the impact of text type on translation strategies, and other issues of interest to both the translation scholar and the linguist. Most importantly, this concrete resource allows us to develop a framework for investigating the validity of theoretical statements about the nature of translation with reference to actual translation practice.

The corpus consists of unabridged, published translations into English from European and non-European source languages. Four text categories are currently represented in TEC: biography, fiction, newspapers and inflight magazines. The current size of the corpus nears 7 million tokens. New data is being added to the corpus on a regular basis, so this figure and the size of the corpus will continue to grow in the next few years.

Although the use of corpus analysis techniques in the field of translation studies is still in its infancy, we are aware of initiatives similar to TEC being pursued in Italy, Germany and Finland. Such efforts are likely to benefit from a common framework for sharing of resources or services built along the lines of the discussed below.

2. CORPUS ANNOTATION AND METADATA

In addition to original texts, TEC records metadata regarding the translator, the translation, the translation process, the author, and the source text. These attributes are recorded in separate header files. TEC headers are XML-encoded, though they currently do not adhere to any particular annotation guidelines (such as TEI) or metadata scheme (such as Dublin Core). There are two types of header. One for single volumes such as novels and biographies, and one for collected works, such as collections of short stories by different authors and translators or a collection of newspaper articles published in the same newspaper issue. A sample TEC header can be seen in appendix 1 1.

Some of the tags might seem a bit surprising at first. However, all the data stored in these headers have been found to be relevant to translation research. A TEC user may wish, for instance, to compare lexical density, type/token ratio, sentence length, collocational patterning, or the use or frequency of specific lexical items in:

The meta-information carried by the header is the primary source for building index tables to support this kind of specialised query.

Unlike the header files, the actual text is only lightly annotated. The primary purpose of annotating TEC text files is to preserve the integrity of the translated text. Thus, non-translated material such as an editor's note or a preface originally written in the target language are marked-up and ignored by the indexing program so that they do not produce concordances or entries in the frequency list when the database is queried by the corpus access tools. An excerpt of a TEC file (extracted from the newspaper corpus) is shown in table 1.

<omit desc="caption">
  Chancellor Kohl . . . through all the embarassing twists, still the
  dominant player
</omit>
<p> 
<omit desc="title">KING MAKING</omit>
<p> 
<omit desc="synopsis">
  Will Chancellor Kohl's vision of democracy prevail in united
  Germany's first presidential election? Robert Leicht argues that the
  liberals' time has come
</omit>
<p> Die Zeit
<p> THE GERMAN head of state has no power. But a
presidential election tells us much about power in this land - and the
health of its political culture. On Monday the first president of a
reunited Germany will be elected - but the heavily symbolic aspect of
this no longer plays a role in the outcome. For the second time in the
history
[...]

Table 1: Sample TEC file

3. COPYRIGHT ISSUES AND SOFTWARE REQUIREMENTS

One of the main goals set by us in the design of the TEC tools was to make corpus analysis widely accessible to translation researchers. This goal makes the Internet is an obvious choice of medium. However, that choice entails certain problems of its own.

About 80% of texts in TEC consist of translated fiction, including bibliography. All this material is subject to copyright restrictions. In order to be able to make the corpus accessible as a resource for researchers and translators while abiding by copyright laws it was necessary to protect the original texts. This was done by allowing only indirect access to the texts, via the concordance server and browsers. The system architecture described below satisfies this constraint, since the clients only receive lists of words drawn from a variety of texts, and their immediate contexts.

In addition to copyright restrictions, other, more traditional data and processing constraints played a role in defining the TEC architecture. Corpus processing requires efficient access to large volumes of linguistic data. Most computational approaches to corpora concentrate on building large, centralised repositories of text accessed via corpus-specific tools. Both data and tools are normally distributed to the public through physical media such as CDROM's, or (more recently) made available on the Internet by granting users access to a local area network where the corpus is physically stored. TEC departs from those practices by assuming that useful and efficient access to considerable amounts of (written) linguistic data does not necessarily entail accessing vast, centralised repositories. Instead, we propose an architecture where several small or medium size corpora can be kept at different sites and maintained by different organisations, while resource discovery, corpus selection and concordance merging is performed by user clients.

The proposed architecture implies that (i) some processing power must be shifted from corpus servers into the user's browser, (ii) resource discovery must be standardised in order to guarantee interoperability among different corpus service providers and their clients, and (iii) corpus browsers (i.e. clients) must be made available in a variety of platforms.

4. CURRENT SYSTEM ARCHITECTURE

The software components of TEC were designed to enable multiple users to access a common corpus, retrieve concordance sets and perform operations on these sets such as sorting and collocation. The architecture chosen to implement this functionality is the client-server model depicted in figure 1.

The basic functionality for corpus indexing and access is handled at the server side while the client only handles concordance lists generated by the server.

Figure 1: TEC Architecture

The client itself is implemented in a modular fashion, which enables new functionality to be incorporated by means of plug-ins. The core client functionality relates to managing user requests and communication with the server. Once the data has arrived at the client end the appropriate plug-ins take over the processing of the request (Luz, 2000). An API has been defined through which developers can access communication, concordance, and user interface objects client methods that control and extend the behaviour of the TEC browser. The client API has been (partially) documented. The client documentation is available at the TEC web site.

The concordance browser has been implemented in two versions: an applet, which can be used from most browsers, and a stand-alone program written entirely in Java. The core system currently incorporates the ability to sort concordance lists, browse metadata related to each concordance, activate and de-activate (non-validating) XML parsing of text files and headers, extract wider contexts, and save concordances onto the local disk.

Indexing in TEC (i.e. the creation of an inverted indices) is done off-line, prior to on-line concordance browsing. The concordance server implements its own protocol to respond to client queries. The syntax for user queries currently includes the ability to select case sensitive or insensitive search, "wild-cards", and word sequences with specification of intervening context. Metadata are delivered via HTTP. The core client is therefore capable of handling both the inhouse protocol used by the concordance server and standard HTTP (used for metadata lookup). We have chosen to split client-server communication between these two channels for the following reasons. Since media access is clearly a bottleneck in corpus processing, we found that the concordance server can perform more efficiently if it is dedicated to retrieving concordance data. Furthermore, although we cannot provide unrestricted access to the texts, we want to be able to publicise metadata in a way that it can be discovered by external agents, regardless of whether they implement the TEC client interface or not. Since they are not protected by copyright, TEC metadata can be stored, perhaps along with metadata from a other corpora, on a central repository.

The TEC architecture should scale up at the server side as well as it does at the client side. Therefore the architecture can evolve into one where the location of corpora (and indices) themselves is distributed. Several corpora can be kept at different sites and maintained by different organisations (represented by the boxes labelled "Server X" and "Server Y" in figure 1). Each client will be able to choose a set of corpora to query, and the query will then be broadcast to the chosen servers. Each server deals with its own query (which introduces a rudimentary form of parallelism to the model) and sends the information back to the client which merges and presents the data to the user. Although this functionality has not been fully implemented yet, the current client and server software have been designed with that natural extension in mind (see the server documentation at the TEC web site for further details).

5. PLANS FOR AN EXTENDED FRAMEWORK

Our main short term goal is the implementation functionality for selecting sub-corpora within a single corpus server. At the same time, we wish to coordinate efforts with similar projects in order to develop common protocols for discovery and querying of distributed corpora. We also seek to further the development of TEC's plug-in architecture and API as an open source project, so that plug-in tools, as well as data, can be shared.

At the moment, TEC only uses metadata internally. Due to the nature of the data made available for browsing by the TEC server, our repository should probably be viewed as a service, rather than a resource. In a wider framework encompassing several corpus service providers, the structure of our metadata might need to be re-designed.

5.1 Difficult Issues

Extending the corpus processing architecture to one where multiple corpus servers inter-operate is not a trivial task. What level of granularity does one need to be able to encode in one's metadata structures? The proposals we have seen so far range from overly general sets of metadata elements, such as the Dublin Core, to the overly complex ontologies advocated by agent researchers. For the purposes of translation studies, it would be extremely convenient if our typical user (a translation researcher) had access to at least two different corpora: a translational corpus such as TEC, and a monolingual corpus, such as the British National Corpus. It seems highly unlikely, however, that corpus providers will be willing to re-implement their current server software to comply with a newly introduced proposal. A more realistic expectation would be that proxies could be implemented which would act as transducers between new and legacy protocols.

Protocol descriptions obviously need to be publicised in order for this kind of approach to work. It is also desirable that transducers (or whatever approach is adopted) operate autonomously, otherwise "resource discovery" would become quite an uninteresting task for researchers. The question then is: should processing constraints, such as protocol descriptions, form part of the metadata used to describe a resource? Little, if anything, seems to have been done toward integrating legacy systems in existing standardisation efforts. We feel, however, that the Dublin Core initiative has taken a step in the right direction by recasting its model within the Resource Description Framework (RDF), thus providing different resource maintainers with mechanisms for extending the basic model (Miller et al. 1999). It is clear that resource discovery is only part of the problem to be addressed by a web-based repository of language documentation and description. One also needs to be able to use the resources once they have been discovered. The issue of how much, in terms of processing constraints, to describe with metadata seems to be a very relevant one. One which has yet to receive the attention it deserves.

6. AVAILABILITY OF THE TEC SOFTWARE

In order to stimulate the involvement of the research community in the project, the TEC toolkit has been made available as free software, under the terms of the FSF's General Public License. Client and server sources, installation instructions, and API documentation can be downloaded from the download area of the TEC web site2. A TEC server of translated English runs on a permanent basis at UMIST, so that TEC browsers can be tested. Although the current release of the software is at beta stage, the server has been actively used by students and researchers for several months.

REFERENCES

Baker, Mona. 1999.
The role of corpora in investigating the linguistic behaviour of professional translators. International Journal of Corpus Linguistics, 4(2):281-298.

Luz, S. 2000.
A software toolkit for sharing and accessing corpora over the Internet. In Proceedings of the Second International Conference on Language Resources and Evaluation: LREC-2000, pages 1749-1754, May.

Miller, E, Miller, P and Brickley, D. 1999
Guidance on expressing the Dublin Core within the Resource Description Framework (RDF). Dublin Core Working Draft. http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/


Notes:

1 The DTD has been omitted for clarity.
2The TEC web site is located at http://tec.ccl.umist.ac.uk/.