The application of annotation models for the construction of databases and tools: Overview and analysis of MPI work since 1994 Hennie Brugman, Peter Wittenburg Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands At the Max-Planck Institute for Psycholinguistics [1] software development in the area of linguistic annotation has been ongoing since 1994. This has resulted in a few generations of models, linguistic databases and software tools. This paper gives an overview and critical evaluation of this evolutionary process and relates it to recent efforts to attain standardization and cooperation. One of the first tools to be developed at the MPI is MediaTagger [2], a Macintosh program for annotation of Quicktime movies. MediaTagger's implicit data model already supported multiple, user-definable and typed tiers, time constraints between annotations on different tiers, and association of closed vocabularies for annotations with tiers. Shortly after MediaTagger's first release, a first formal data model for linguistic annotations was constructed, applying methods from database design. This relational model was implemented with Oracle. MediaTagger was extended to support working with this database system in client-server mode over ODBC [3]. A graphical tool for the specification of multilayer, temporal queries was designed and built using OracleForms and OracleGraphics. This database currently contains approximately 80,000 annotations on 3,000 tiers associated with 500 digital video movies. However, there are considerable limitations to this solution. Consequently, in 1997 the EUDICO (European Distributed Corpora) project was initiated [4]. Its aim is to build a software framework for the creation and exploitation of linguistic annotations of multimedia signals. This framework should enable annotation file format independence, computer platform independence, client-server operation on distributed archives on the Internet and use of streaming media. New software tools and annotation file formats can also be added. To establish this we chose to use a three-layer architecture as commonly used for database systems. Tools from the application layer use the generic services of objects existing in a logical layer. These objects can be instantiated from a number of differently formatted resources at the physical layer. The objects at the logical layer are specified by an object-oriented model for linguistic annotations, the Abstract Corpus Model (ACM). The paper will present two generations of the ACM and compare it to similar models, particularly the ATLAS/AIF [5]. It will evaluate how far EUDICO's aims are achieved by each generation of the model. EUDICO technology is currently applied in the Spoken Dutch Corpus project [6]; the DOBES project [7], and a recently released annotation tool (EAT, Eudico Annotation Tool). In the near future we expect to add metadata support to the EUDICO framework by extending our ACM model and by building in our IMDI [8] compliant metadata browser. [1] http://www.mpi.nl [2] http://www.mpi.nl/world/tg/CAVA/mt/MTgeneral.html [3] http://www.mpi.nl/world/tg/CAVA/CAVA.html [4] http://www.mpi.nl/world/tg/lapp/eudico/eudico.html [5] http://www.nist.gov/speech/atlas/ [6] http://lands.let.kun.nl/cgn/home.htm [7] http://www.mpi.nl/DOBES/ [8] http://www.mpi.nl/ISLE/