This is a new, 3-year NSF project, awarded to Steven Bird and Mark Liberman. The proposal is available online at: www.ldc.upenn.edu/exploration/nsf.pdf.
Advances in mass storage technology now make it possible to collect almost arbitrary amounts of speech, text and other linguistic data in digital form. Advances in computer hardware and software make it possible to annotate this data efficiently with representations of linguistic structure and function, and to search the results flexibly and easily. Advances in networking mean that vast amounts of such data can be published at little or no marginal cost. These developments have revolutionized research practices in speech and language technology over the past decade, and have strongly influenced scientific research on language. More profound changes will follow as research practices catch up with the new opportunities.
This project aims to foster a new mode of fundamental research in linguistics, namely ``web-based exploration of linguistic field data.'' The objectives are to develop tools for manipulating linguistic databases, to store and disseminate large datasets using the model, to exploit the tools and datasets in teaching and research, and -- underlying all of the above -- to explore and exploit new methods for representing and analyzing multimodal linguistic data. A set of collaborators have granted access to their field data for the purposes of this project, and have agreed to test out the new tools on their data; letters of support are attached. A larger set of potential collaborators have also been identified [www.ldc.upenn.edu/sb/exploration.html] and will form an important contact group as the project progresses.
All of the primary data created by this project will be published on the web site of the Linguistic Data Consortium (LDC), for general public access. Materials created by other collaborators will also be published there, subject to the permission of the authors. All tools and documentation produced by the project will be freely available to others.
The three phases sketched below will each occupy approximately one year.
Phase 1: Hyperlex Version 2. In the first phase, we will extend HyperLex (Version 1) from a language-specific prototype running inside web-browsers, to a flexible and powerful tool that works with any language (HyperLex Version 2), while continuing to conduct research on some of the languages and on appropriate models for linguistic databases.
Phase 2: New Models and Prototypes for Linguistic Exploration. In the second phase, we will create a family of prototypes for creating, managing and inter-linking the full range of data types, incorporate a large amount of linguistic field data into the model, and explore theoretical issues in database design for linguistic corpora.
Phase 3: Linguistic Analysis, Dissemination and Teaching. The models, tools and datasets provided by the first and second phases will be applied to a range of linguistic research problems. At least one publication or conference paper will be produced in each of the following areas: Tone in West African languages; The data provenance problem in linguistic databases; A query language for phonological exploration; and Field linguistics as a computational problem.
I would like to be in contact with anyone who would like to disseminate linguistic field data, and/or to test out the tools developed by this project.