CALL FOR PARTICIPATION Web-Based Language Documentation and Description Philadelphia USA, 12-15 December 2000 http://www.ldc.upenn.edu/exploration/ Institute for Research in Cognitive Science University of Pennsylvania Organizers: Steven Bird (U Penn) and Gary Simons (SIL International) This workshop will lay the foundation of an open, web-based infrastructure for collecting, storing and disseminating the primary materials which document and describe human languages, including wordlists, lexicons, annotated signals, interlinear texts, paradigms, field notes, and linguistic descriptions, as well as the metadata which indexes and classifies these materials. The infrastructure will support the modeling, creation, archiving and access of these materials, using centralized respositories of metadata, data, best practice guidelines, and open software tools. 1. Background Recent years have witnessed dramatic advances in the mass storage and web delivery technologies, making it possible to house virtually unlimited quantities of speech data online, and to disseminate this data over the web. The development of XML and Unicode greatly facilitate the interchange and reuse of structured multimodal and multilingual data and the development of interoperating software tools. These developments are having a pervasive influence on the way primary linguistic data are gathered, stored, analyzed and disseminated, as demonstrated by the initiatives surveyed on the linguistic exploration page (http://www.ldc.upenn.edu/exploration/), and the papers presented at the Linguistic Exploration Workshop at the Chicago LSA Meeting (http://www.ldc.upenn.edu/exploration/LSA/). 1.1 Challenges With these new technological opportunities are concomitant needs and challenges for modeling, creating, archiving and accessing data: I Data Models. A diverse range of data types are required in language documentation and linguistic fieldwork, including word lists, lexicons, annotated signals, writing system documentation, interlinear texts, paradigms, field notes, and linguistic descriptions. We need flexible and general models for these data types (including links between them), and good ways to represent information which is either partial, uncertain, evolving, or disputed. We need to develop a consensus in the community regarding best practice for modeling these kinds of data, to ensure maximal reusability of data and software. II Data Archives. Whether just the private collection of a single researcher or a large and centralized repository, language data needs to be stored and reused. To support this, we need durable and open storage and interchange formats that embody the best practice consensus. We need to convert (parochial) 8-bit character codings to Unicode, using a general tool for character conversion along with a host of conversion tables for specific character sets. We also need to convert markup into the best practice formats we have defined. We need a mechanism to support durable citation of data, so that document authors do not need to duplicate all the data they reference just to be sure that the links will not break. More generally, we need a metadata standard for indexing the resources, regardless of format and availability, and a wide-coverage index conforming to the standard, so that someone interested in a particular language or region can find all the electronic resources that are pertinent to it, without having to determine how each of several different archives have named and classified their holdings. III Data Creation. Now that mass storage is so inexpensive, researchers are creating large amounts of digital data covering the types listed above. Both the number and scale of these collection efforts are growing rapidly. We need software tools supporting data creation, conforming with best practice, and covering primary collection of textual data (wordlists, texts) and recordings (audio, video, physiological), along with transcription and annotation of the primary materials conforming to a broad range of descriptive and analytical practices. IV Data Access. Once data has been created and archived, there exist a variety of access modes. A region of data is identified by browsing, by launching a query, or by following a reference. The selection is displayed according to appropriate conventions and styles, or converted into some other form (e.g. for statistical analysis and visualization). The selection may be corrected, imported into a document, analyzed, and annotated, leading to the creation of secondary data and/or the elicitation of new primary data. We need to develop suitable delivery mechanisms including stylesheets, conversion tools, indexing methods, and query languages, which encompass the needs for security and privacy. We need standard application programming interfaces and a library of reusable components, to support the development of software for new modes of access. Many of the activities listed above are already underway; the lure of the technology is great despite the lack of infrastructure. However, it is beyond the capacity of any single individual or institution to develop this infrastructure of standards and tools on their own. There is a pressing need for close cooperation between these initiatives, so that scarce human, software and data resources are used optimally. 1.2. Basic Data Types A diverse range of data types was listed in I above. These are explained below. Metadata: this covers the description of what is in a data resource that can be used in the online catalog as an aid to finding the resource. Word list: a list of wordforms in the language indexed by reference glosses (for example, a Swadesh word list). Unlike a lexicon, the indexing against a list of universal reference glosses provides a data structure for cross-linguistic comparison. Lexicon: a listing of the lexical items in a language with descriptions of their phonological form, morphosyntactic function, and semantics. Annotated signal: recorded digital signals (audio, video, physiological), or digital images (photograph, map, scanned text), that are annotated with descriptive and analytical information. Writing system: a description of the writing system used to express text in the language. Interlinear text: a primary text annotated with one or more of the following kinds of information: citation wordforms, morphologically analyzed wordforms, part-of-speech tags, glosses, and phrasal translations. Paradigm: any kind of rational tabulation of forms, such as words or phrases demonstrating phonological, morphological, syntactic or semantic contrasts. Field notes: the record of a linguist's observations, including fragments of all the other data types plus free commentary. Linguistic description: any work of prose that describes some aspect of the language (for instance, a grammar sketch or a workpaper on phonology). 1.3 General Functions involving the Data Types Whether in practice or in theory, data of these various kinds is subject to various general functions, as listed below. Store: In order to store data, formats are established using XML DTDs (or Schemas) and documented with coding manuals. For each format, the starting point is an abstract data type, a logical model of a certain kind of data and the well-defined operations on that data. The formats have corresponding application programming interfaces. Create: Existing and new tools for creating data support interchange using the best practice formats. Agreed application programming interfaces support tool reuse and integration. Convert: Data is converted from its original format to the format prescribed as best practice (import). Data is converted from the best practice format to other formats for analysis (export). Conversion encompasses both character encoding and markup. Display: For the display of data, appropriate visualizations are established using stylesheets (especially XSL), or special-purpose programs (e.g. applets and standalone applications). Query: Users query metadata to find resources, then they query the resources to access their content. This is accomplished with general-purpose XML and relational query languages, specialized forms-based interfaces, and standalone applications. The relationship between these functions and the challenges listed in 1.1 is direct: the Store function corresponds to I Data Models and II Data Archives, while the Create and Convert functions correspond to III Data Creation, and the Display and Query functions correspond to IV Data Access. 3. Workshop Objectives This workshop will lay the foundation of an open, web-based infrastructure for collecting, storing and disseminating the primary materials which document and describe human languages. The infrastructure will support the modeling, creation, archiving and access of these materials, using centralized respositories of metadata, data, best practice guidelines, and open software tools. To meet this goal, we have identified three main objectives which can be substantially achieved at the present time: Objective 1: to develop a comprehensive framework which identifies all the infrastructural needs, designates appropriate roles for existing results as pieces of an overall solution, and sets out a coordinated response to the remaining challenges. Objective 2: to found centralized repositories (and nominate existing ones) for housing components of the infrastructure, so that data, tools, formats and standards can be collected, indexed, and made available to the community. Objective 3: to begin construction of the repositories, by identifying the contribution of past and present activities by the participants and by other individuals and institutions, and by gathering the results and their documentation. 3.1. Objective 1: The Infrastructure Framework The resources and expertise are spread across many institutions, and so a distributed approach is necessary. Such an approach can work so long as certain principal components of the infrastructure are centralized. Some such components are listed below. (a) A set of standards for metadata and data formatting needs to be adopted by the community and managed at a centralized site that can be accessed by all archives, linguists, and software developers. Similarly, software that supports these standards needs to be deposited at a centralized site so as to prevent individual archives and researchers from having to reinvent it. (b) Standardized metadata for every holding of each archive need to be deposited in a centralized catalog so that researchers the world over have only a single web site to consult in order to find out what is available in all the linguistic archives on the web. (c) A standardized system for identifying and classifying the world's languages needs to be maintained at a centralized site so that resources about the same language are consistently identified as such in every archive that contains documentation for it. All of these centralized functions need not be at the same site, as long as the linguistic community knows where to go for each of the centralized functions. 3.2. Objective 2: Defining the Repositories Open, online repositories for the above infrastructural components need to be identified, and created where they do not already exist. At least the following three repositories will be required, corresponding to the components listed in 3.1 above. (a) a repository for models, formats and tools (b) a repository for metadata (c) a repository for language classifications For (a), we need to identify the kinds of data for which the community needs to establish guidelines for best practice; develop best practice guidelines for the markup of each data type along with corresponding DTDs; develop stylesheets, converters, and other software for each markup standard; and post all of these to a repository. Note that, while a single format might be nominated as best practice, other formats that are in use could still be listed and documented. When mature, this repository could hold hundreds of resources; thus it is important to develop a means for organizing it. A possible organization is a two-dimensional "repository framework" based on the basic data types and general functions enumerated above. In the following table, the horizontal dimension ranges over the functions and the vertical dimension ranges over the datatypes. The bottom row, labeled `Common', is for resources that are common to all data types (such as fonts or character-code conversion tables), and for auxiliary databases such as informant details and recording dates/locations. The cells will be populated with annotated data samples, software requirements, data models, XML DTDs and stylesheets, application programming interfaces, implementations, software tools, best practice guidelines, and so on. Table 1: Proposed Framework for a Repository of Data Models, Formats and Tools NEEDS +----------+-------------------+-------------------+ | Data | | | | Models & | Data Creation | Data Access | | Archives | | | DATA TYPE +----------+---------+---------+---------+---------+ | Store | Create | Convert | Display | Query | -----------------+----------+---------+---------+---------+---------+ Metadata | | | | | | -----------------+----------+---------+---------+---------+---------+ Word list | | | | | | -----------------+----------+---------+---------+---------+---------+ Lexicon | | | | | | -----------------+----------+---------+---------+---------+---------+ Annotated signal | | | | | | -----------------+----------+---------+---------+---------+---------+ Writing system | | | | | | -----------------+----------+---------+---------+---------+---------+ Interlinear text | | | | | | -----------------+----------+---------+---------+---------+---------+ Paradigm | | | | | | -----------------+----------+---------+---------+---------+---------+ Field notes | | | | | | -----------------+----------+---------+---------+---------+---------+ Description | | | | | | -----------------+----------+---------+---------+---------+---------+ Common | | | | | | -----------------+----------+---------+---------+---------+---------+ Any resource to be deposited in the repository would need to be placed in one of these cells. The index page of the implemented repository could be a table like this, with a link in each cell jumping to a page of relevant resources. We can use this repository framework to categorize the work of a developer by identifying which cell the result would go into. We can also look for cells with no contents to identify areas for future work. 3.3. Objective 3: Constructing the Repository for Models, Formats and Tools The workshop will collect existing resources and results and place them in the repository, and establish an agenda for subsequent work to create and collect the resources. 4. Call for Participation The workshop will include paper presentations and working sessions to develop the infrastructure. Interested members of the community are invited to participate in the workshop. There is a limit on available places, and participants will be identified on the basis of submitted abstracts. Funding is available for authors of accepted papers. Abstracts. One page abstracts are invited which describe substantive contributions to the Repository of Models, Formats and Tools. Contributions would include annotated data samples, software requirements, data models, XML DTDs and stylesheets, application programming interfaces, software tools that work with open formats, best practice guidelines, and so on. Abstracts should identify which contributions could be made before the meeting (as material for group discussion), and which could be made after the meeting. Abstracts are also invited which discuss concrete problems for web-based language documentation and description, and describe possible solutions. (Authors whose work touches on distinct areas of the repository are encouraged to submit more than one abstract.) Papers. Authors of accepted abstracts will be asked to prepare a 2-3,000 word paper plus associated materials (such as data samples and DTDs). These full length papers are to be formatted as plain text or minimally marked up with HTML (with hyperlinks to the associated materials). Submissions should be emailed as MIME attachments (multi-file submissions should first be archived using tar or zip). Address submissions to: Steven.Bird@ldc.upenn.edu, Gary_Simons@sil.org TIMETABLE Friday 1 September Abstract deadline Friday 29 September Acceptance notification Friday 24 November Paper deadline 12-15 December Workshop PROCEEDINGS The papers will be published in web and hardcopy form (the latter just for workshop attenders). Papers submitted in HTML should be written with the hardcopy version in mind, so a text string which anchors a hyperlink should be directly interpretable, rather than e.g. "visit this link". VENUE The workshop will be held at the Institute for Research in Cognitive Science (IRCS) at the University of Pennsylvania, in Philadelphia, USA. Workshop sessions will take place in IRCS conference rooms, located on the fourth floor of 3401 Walnut Street, adjacent to the university campus, which is two miles west of the city center. The main meeting rooms will be equipped with the usual presentation facilities, including projection and audio facilities. SPONSORSHIP IRCS and US-ISLE are sponsoring the workshop, and so there will be no registration fee, and hotel accomodation will be covered for authors of accepted papers. Additional support for travel may also be available; please describe your needs when submitting the abstract. USEFUL WEBSITES http://www.ldc.upenn.edu/exploration/ Linguistic exploration http://www.upenn.edu/philadelphia/ Philadelphia http://www.upenn.edu/fm/map/dir.html Getting to Penn http://www.cis.upenn.edu/~ircs/ IRCS homepage http://www.cis.upenn.edu/~ircs/driving.html Finding IRCS http://www.upenn.edu/fm/facts/bi/bi0416.html Photo: 3401 Walnut St http://www.ldc.upenn.edu/sb/isle.html US-ISLE Project FURTHER INFORMATION From time to time, further information will be posted on the website and the mailing list. To be sure of receiving this information and related announcements, please bookmark the linguistic exploration page (http://www.ldc.upenn.edu/exploration/) and subscribe to the LINGUISTIC-EXPLORATION mailing list, referenced from that page. -- Steven Bird Gary Simons University of Pennsylvania SIL International Steven.Bird@ldc.upenn.edu Gary_Simons@sil.org http://www.ldc.upenn.edu/sb http://www.sil.org/SIL/roster/simons.htm