This workshop will lay the foundation of an open, web-based infrastructure for collecting, storing and disseminating the primary materials which document and describe human languages, including wordlists, lexicons, annotated signals, interlinear texts, paradigms, field notes, and linguistic descriptions, as well as the metadata which indexes and classifies these materials. The infrastructure will support the modeling, creation, archiving and access of these materials, using centralized repositories of metadata, data, best practice guidelines, and open software tools.
1. BACKGROUND
Recent years have witnessed dramatic advances in the mass storage and web delivery technologies, making it possible to house virtually unlimited quantities of speech data online, and to disseminate this data over the web. The development of XML and Unicode greatly facilitate the interchange and reuse of structured multimodal and multilingual data and the development of interoperating software tools. These developments are having a pervasive influence on the way primary linguistic data are gathered, stored, analyzed and disseminated, as demonstrated by the initiatives surveyed on the linguistic exploration page (http://www.ldc.upenn.edu/exploration/), and the papers presented at the Linguistic Exploration Workshop at the Chicago LSA Meeting (http://www.ldc.upenn.edu/exploration/LSA/).
1.1 Challenges
With these new technological opportunities are concomitant needs and challenges for modeling, creating, archiving and accessing data:
Many of the activities listed above are already underway; the lure of the technology is great despite the lack of infrastructure. However, it is beyond the capacity of any single individual or institution to develop this infrastructure of standards and tools on their own. There is a pressing need for close cooperation between these initiatives, so that scarce human, software and data resources are used optimally.
1.2. Basic Data Types
A diverse range of data types was listed in I above. These are explained below.
| Metadata: |
This covers the description of what is in a data resource that can be used in the online catalog as an aid to finding the resource. |
| Word list: |
A list of wordforms in the language indexed by reference glosses (for example, a Swadesh word list). Unlike a lexicon, the indexing against a list of universal reference glosses provides a data structure for cross-linguistic comparison. |
| Lexicon: |
A listing of the lexical items in a language with descriptions of their phonological form, morphosyntactic function, and semantics. |
| Annotated signal: |
Recorded digital signals (audio, video, physiological), or digital images (photograph, map, scanned text), that are annotated with descriptive and analytical information. |
| Writing system: | A description of the writing system used to express text in the language. |
| Interlinear text: |
A primary text annotated with one or more of the following kinds of information: citation wordforms, morphologically analyzed wordforms, part-of-speech tags, glosses, and phrasal translations. |
| Paradigm: |
Any kind of rational tabulation of forms, such as words or phrases demonstrating phonological, morphological, syntactic or semantic contrasts. |
| Field notes: |
The record of a linguist's observations, including fragments of all the other data types plus free commentary. |
| Linguistic description: |
Any work of prose that describes some aspect of the language (for instance, a grammar sketch or a workpaper on phonology). |
1.3 General Functions involving the Data Types
Whether in practice or in theory, data of these various kinds is subject to various general functions, as listed below.
| Store: |
In order to store data, formats are established using XML DTDs (or Schemas) and documented with coding manuals. For each format, the starting point is an abstract data type, a logical model of a certain kind of data and the well-defined operations on that data. The formats have corresponding application programming interfaces. |
| Create: |
Existing and new tools for creating data support interchange using the best practice formats. Agreed application programming interfaces support tool reuse and integration. |
| Convert: |
Data is converted from its original format to the format prescribed as best practice (import). Data is converted from the best practice format to other formats for analysis (export). Conversion encompasses both character encoding and markup. |
| Display: |
For the display of data, appropriate visualizations are established using stylesheets (especially XSL), or special-purpose programs (e.g. applets and standalone applications). |
| Query: |
Users query metadata to find resources, then they query the resources to access their content. This is accomplished with general-purpose XML and relational query languages, specialized forms-based interfaces, and standalone applications. |
The relationship between these functions and the challenges listed in 1.1 is direct: the Store function corresponds to I Data Models and II Data Archives, while the Create and Convert functions correspond to III Data Creation, and the Display and Query functions correspond to IV Data Access.
2. WORKSHOP OBJECTIVES
This workshop will lay the foundation of an open, web-based infrastructure for collecting, storing and disseminating the primary materials which document and describe human languages. The infrastructure will support the modeling, creation, archiving and access of these materials, using centralized respositories of metadata, data, best practice guidelines, and open software tools.
To meet this goal, we have identified three main objectives which can be substantially achieved at the present time:
2.1. Objective 1: The Infrastructure Framework
The resources and expertise are spread across many institutions, and so a distributed approach is necessary. Such an approach can work so long as certain principal components of the infrastructure are centralized. Some such components are listed below.
| (a) |
A set of standards for metadata and data formatting needs to be adopted by the community and managed at a centralized site that can be accessed by all archives, linguists, and software developers. Similarly, software that supports these standards needs to be deposited at a centralized site so as to prevent individual archives and researchers from having to reinvent it. |
| (b) |
Standardized metadata for every holding of each archive need to be deposited in a centralized catalog so that researchers the world over have only a single web site to consult in order to find out what is available in all the linguistic archives on the web. |
| (c) |
A standardized system for identifying and classifying the world's languages needs to be maintained at a centralized site so that resources about the same language are consistently identified as such in every archive that contains documentation for it. |
All of these centralized functions need not be at the same site, as long as the linguistic community knows where to go for each of the centralized functions.
2.2. Objective 2: Defining the Repositories
Open, online repositories for the above infrastructural components need to be identified, and created where they do not already exist. At least the following three repositories will be required, corresponding to the components listed in 3.1 above.
| (a) | a repository for models, formats and tools |
| (b) | a repository for metadata |
| (c) | a repository for language classifications |
For (a), we need to identify the kinds of data for which the community needs to establish guidelines for best practice; develop best practice guidelines for the markup of each data type along with corresponding DTDs; develop stylesheets, converters, and other software for each markup standard; and post all of these to a repository. Note that, while a single format might be nominated as best practice, other formats that are in use could still be listed and documented.
When mature, this repository could hold hundreds of resources; thus it is important to develop a means for organizing it. A possible organization is a two-dimensional "repository framework" based on the basic data types and general functions enumerated above. In the following table, the horizontal dimension ranges over the functions and the vertical dimension ranges over the datatypes. The bottom row, labeled `Common', is for resources that are common to all data types (such as fonts or character-code conversion tables), and for auxiliary databases such as informant details and recording dates/locations. The cells will be populated with annotated data samples, software requirements, data models, XML DTDs and stylesheets, application programming interfaces, implementations, software tools, best practice guidelines, and so on.
Table 1: Proposed Framework for a Repository of Data Models, Formats and Tools
| Data Models & Archives | Data Creation | Data Access | |||
| DATA TYPE | Store | Create | Convert | Display | Query |
| Metadata | |||||
| Word list | |||||
| Lexicon | |||||
| Annotated signal | |||||
| Writing system | |||||
| Interlinear text | |||||
| Paradigm | |||||
| Field Notes | |||||
| Description | |||||
| Common | |||||
Any resource to be deposited in the repository would need to be placed in one of these cells. The index page of the implemented repository could be a table like this, with a link in each cell jumping to a page of relevant resources. We can use this repository framework to categorize the work of a developer by identifying which cell the result would go into. We can also look for cells with no contents to identify areas for future work.
2.3. Objective 3: Constructing the Repository for Models, Formats and Tools
The workshop will collect existing resources and results and place them in the repository, and establish an agenda for subsequent work to create and collect the resources.