|
IRCS Workshop on Linguistic Databases
11-13 December 2001
University of Pennsylvania, Philadelphia, USA
Organized by Steven Bird, Peter Buneman and Mark Liberman
Funded by the National Science Foundation
|
|
Workshop Overview
Linguistic databases are digital repositories of structured information
intended to document natural language and natural communicative
interaction. Over the last decade, linguistic databases have come to stand
at the center of empirical research in the language sciences, and in the
development of new human language technologies. Like genomic databases,
linguistic databases are complex, evolving and richly annotated
repositories, and pose interesting challenges for efficient representation,
indexing and query. And like most scientific databases, linguistic
databases have made little use of standard database technology.
The goals of the workshop are to take stock of existing research in
linguistic databases, to identify the key problems, and to explore
applications of current database research to these problems. More broadly,
the workshop will help define the research questions of a new "linguistic
database community" and initiate the ongoing interchange of relevant
problems and results between this community and the database community at
large.
The workshop is expected to attract participants from a range of
specialties including databases, linguistics, computational linguistics,
annotation and markup. There will be tutorial-style presentations on
relevant models in each of these areas.
The workshop will address a selection of the following topics:
MODELS
- models for text databases, speech databases, multimodal databases,
typological databases, geographical databases (language maps),
and metadata repositories
- relational, object-oriented and semi-structured models for
representing linguistic annotations
- representations for specific linguistic datatypes (e.g. databases of
aligned parallel text)
- modelling temporal and (geo)spatial structure
- critical analysis of existing linguistic databases
- special problems for systematic data representation posed by
linguistic fieldwork
LANGUAGES
- query of multilayer annotations
- linguistic applications/extensions of XML query languages
- analysis of existing ad hoc query languages
- queries over temporal and (geo)spatial structure
OTHER TOPICS
- database support (e.g. what standard database technology has proven
worthwhile for linguistic databases?)
- systematic methods for populating linguistic databases
- appropriate indexing methods for linguistic strings and structures
- archiving and preservation
- metadata standards serving as finding aids for linguistic databases
- data provenance / data lineage
- annotation servers
Steven Bird,
Peter Buneman, &
Mark Liberman
(LDC,
CIS, &
Linguistics)
Email:
sb@ldc.upenn.edu, peter@cis.upenn.edu, myl@unagi.cis.upenn.edu