Multidimensional Exploration of Linguistic Databases

Steven Bird, University of Pennsylvania

Linguistic Data Consortium, University of Pennsylvania
3615 Market St, Suite 200, Philadelphia, PA 19104-2608, USA
sb@ldc.upenn.edu | www.ldc.upenn.edu/sb


This is a new, 3-year NSF project, awarded to Steven Bird and Mark Liberman. The proposal is available online at: www.ldc.upenn.edu/exploration/nsf.pdf.

Overview

Advances in mass storage technology now make it possible to collect almost arbitrary amounts of speech, text and other linguistic data in digital form. Advances in computer hardware and software make it possible to annotate this data efficiently with representations of linguistic structure and function, and to search the results flexibly and easily. Advances in networking mean that vast amounts of such data can be published at little or no marginal cost. These developments have revolutionized research practices in speech and language technology over the past decade, and have strongly influenced scientific research on language. More profound changes will follow as research practices catch up with the new opportunities.

This project aims to foster a new mode of fundamental research in linguistics, namely ``web-based exploration of linguistic field data.'' The objectives are to develop tools for manipulating linguistic databases, to store and disseminate large datasets using the model, to exploit the tools and datasets in teaching and research, and -- underlying all of the above -- to explore and exploit new methods for representing and analyzing multimodal linguistic data. A set of collaborators have granted access to their field data for the purposes of this project, and have agreed to test out the new tools on their data; letters of support are attached. A larger set of potential collaborators have also been identified [www.ldc.upenn.edu/sb/exploration.html] and will form an important contact group as the project progresses.

All of the primary data created by this project will be published on the web site of the Linguistic Data Consortium (LDC), for general public access. Materials created by other collaborators will also be published there, subject to the permission of the authors. All tools and documentation produced by the project will be freely available to others.

Objectives

1: to represent multimodal linguistic databases
Create a data model for representing inter-linked linguistic data including lexicons, (interlinear) texts, field notes, (annotated) recordings, paradigms, grammar sketches, (annotated) maps and photographs, and problem sets.
2: to develop tools for manipulating linguistic databases
Develop platform-independent open-source tools for creating, browsing, searching, querying and transforming inter-linked linguistic data, integrated with existing tools and formats, and tailored to the continuously evolving data models associated with ongoing field research.
3: to store and disseminate large datasets using the model
Upload, inter-link and enrich existing multimodal field data, beginning with the proposer's own data, and expanding to include the work of other Penn collaborators, then scholars outside Penn. Distribute the datasets and tools on CD-ROM and via the Web.
4: to exploit the tools and datasets in teaching and research.
Throughout the development process, investigate phonological, morphological, orthographic and lexical properties of the datasets as part of ongoing primary research on the languages, and in concert with the teaching of courses on field methods, computational linguistics, and particular minority languages.

Workplan

The three phases sketched below will each occupy approximately one year.

Phase 1: Hyperlex Version 2. In the first phase, we will extend HyperLex (Version 1) from a language-specific prototype running inside web-browsers, to a flexible and powerful tool that works with any language (HyperLex Version 2), while continuing to conduct research on some of the languages and on appropriate models for linguistic databases.

Phase 2: New Models and Prototypes for Linguistic Exploration. In the second phase, we will create a family of prototypes for creating, managing and inter-linking the full range of data types, incorporate a large amount of linguistic field data into the model, and explore theoretical issues in database design for linguistic corpora.

Phase 3: Linguistic Analysis, Dissemination and Teaching. The models, tools and datasets provided by the first and second phases will be applied to a range of linguistic research problems. At least one publication or conference paper will be produced in each of the following areas: Tone in West African languages; The data provenance problem in linguistic databases; A query language for phonological exploration; and Field linguistics as a computational problem.

Solicitation

I would like to be in contact with anyone who would like to disseminate linguistic field data, and/or to test out the tools developed by this project.


Linguistic Exploration Workshop