Language Resources for Multilingual and Multimodal Interactive Systems

 
Mark Liberman
University of Pennsylvania
myl@ldc.upenn.edu
Ron Cole
Oregon Graduate Institute
cole@cse.ogi.edu
 

Summary

Importance of the Problem

A growing number of Americans work routinely with networked access to a growing world of digital text, speech and even video, as communication in business, education and research is increasingly mediated by networked computers. This wired world is fertile ground for the deployment of language technologies, and as a result, many applications have moved from the laboratory into mass-market usage. Examples include full-text information retrieval, speech recognition and generation, optical and on-line character recognition, and spelling and grammar checking.

These developments highlight the fact that the fundamental capabilities of language technology are still inadequate. Applications that require automatic document understanding, automatic translation, or spoken human-computer dialogue are now possible only in limited cases. Despite considerable recent progress in the underlying technologies, continued improvement depends on continued fundamental work, for which government support remains essential.

The growth of networked computing also poses several new problems for language technology. One important challenge is providing universal access. For many Americans, for instance those unable to see or to type, internet access requires more than just a computer and an ISP.  Similar issues arise in providing educational access for children (and adults) who have not yet learned to read and write. A second challenge is dealing efficiently with enormous and evolving masses of spoken and written documents, arising from a wide variety of often-unpredictable sources, and increasingly likely to be in multiple languages. Solutions must deploy and coordinate a wide range of advanced language technologies.

In all areas of language technology, research and development are dependent on basic resources such as text and speech corpora, lexicons, and language-processing tools. Such language resources are a powerful driving force and an essential enabler for researchers in institutions of all types: universities, government laboratories, large industrial laboratories, and small start-up enterprises. Research to provide new basic capabilities in language technology will require new basic resources, and government support for planning, creating and distributing such resources is crucial.
 

Workshop on Language Resources

On August 16, 1997, a workshop was convened to discuss the resources needed, and models for their development and distribution. This workshop took place at Stevenson, WA, just before the 1997 NSF Interactive Systems Grantees' Workshop. Participants represented diverse areas of language research and development from U.S.  universities and industrial laboratories, as well as representives of interested U.S. government agencies.

The workshop had a three-part structure. First, participants shared information about the current situtation, by means of an on-line Language Resources Primer prepared by the organizers, and through brief presentations on the current models for design, creation and distribution of resources.

Second, participants discussed three significant problem areas and the resources needed to stimulate and evaluate research advances in each. The three areas were multimodal machine translation, multimodal document understanding, and universal access. The aim was to explore some representative problems carefully and in concrete terms, both as a basis for general conclusions and in order to address some key specific issues.

Finally, a general discussion identified recurrent themes, common problems and recommendations for action.
 

Conclusions of the Workshop

New government initiatives are needed to create the language resources that will support the development  of  the next generation of  language technologies. The workshop identified three key areas where new language resources are needed:
  1. Universal access.  Researchers need speech from wider populations (including school children and less-educated adults), multimedia archives, and corpora of new kinds of computer-mediated communication and human-computer interaction. New forms of networked outreach, valuable in themselves, have great promise for meeting these needs.
  2. Multilingual resources: Communication in business, education and research is increasingly international, and a significant portion of our own citizens do not speak English well. Improved tools for international communication, and effective access to the NII for all Americans, both require improved technology to support written and spoken interaction in languages other than English. New invention in this area is crucially dependent on obtaining new shared multilingual resources, including monolingual text and speech in languages other than English, and parallel text corpora.
  3. Generation and synthesis of spoken language. Resources to support research and development in this area are relatively inexpensive but much needed. The naturalness and expressivity of existing synthesis technology is still not good enough to support effective naturalistic dialogue.
Three key points emerged about the ways and means of creating and distributing these resources:
  1. Intellectual property rights. In making electonic information more accessible, the Internet also makes electronic rights more valuable.  Audio and video materials are even more vigorously protected than text.  Networked distribution of scientific data, with possible use in commercial products, raises new issues of informed consent for research subjects. For all these reasons, the language research community needs effective strategies and tactics for dealing with IPR issues.
  2. New modes of outreach. Language technology researchers are vastly outnumbered by researchers in other scholarly disciplines involved with language resources: linguists, anthropologists, sociologists, psychologists, and so on. Teachers and students across language-related subjects vastly outnumber language researchers of all types. New channels of communication and resource sharing could take advantage of this large reservoir of talent and energy in creating resources of value to all concerned, while also spreading ideas and techniques widely useful in research and education.
  3. Improved infrastructure for data sharing. Most government-sponsored language resources never leave the lab in which they were collected or created, except in the minimal form of summary tables in technical articles and books. Modern networked computers make sharing such resources both easier and more valuable, while advice, along with technical and organizational support, is available from sources such as NIST and the LDC. Diverse approaches to data sharing -- technical, organizational and legal -- need to be better publicized, and pressure from sponsors to publish or otherwise share useful language resources should be broadened.

Background

What are language resources?

Here is a definition from the recent call for the First International Conference On Language Resources And Evaluation.
The term language resources (LR) refers to sets of language data and descriptions in machine readable form, used specifically for building, improving or evaluating natural language and speech algorithms or systems, and in general, as core resources for the software localization and language services industries, for language studies, electronic publishing, international transactions, subject-area specialists and end users. Examples of linguistic resources are written and spoken corpora, computational lexicons, grammars, terminology databases, basic software tools for the acquisition, preparation, collection, management, customization and use of these and other resources.

The relevance of evaluation in Language Engineering is increasingly recognized. This involves assessment of the state-of-the-art for a given technology, measuring the progress achieved within a program, comparing different approaches to a given problem and choosing the best solution, knowing its advantages and drawbacks, assessment of the availability of technologies for a given application, and finally product benchmarking. It accompanies research and development in Human Language Technologies, and has driven important advances in the recent past in various aspects of both written and spoken language processing. Although the evaluation paradigm has been studied and used in large national and international programs, including the US ARPA HLT program, EU Language Engineering projects, the Francophone Aupelf-Uref program and others, particularly in the localization industry (LISA and LRC), it is still subject to substantial unresolved basic research problems.

We support this broad definition of language resources, which encompasses all corpora, lexicons, tools and standards that support research on human language, the development and evaluation of human language technologies, and language education.

Moreover, we support a broad definition of human language technologies, which includes the range of technologies needed to enable interactive systems using natural communication skills. Examples of the diverse natural communication skills that might be used in interactive systems are lip reading, facial movements and expressions, gestures and "body language," handwriting, American Sign Language, and Braile.

Why are shared language resources needed?

Every language teacher knows that to learn to use a language well, students must see and hear a lot of authentic examples, in addition to what they get from dictionaries and grammatical exercises. Over the past couple of decades, scientists and engineers have learned the same lesson. To quote the white paper that led to the establishment of the Linguistic Data Consortium in 1992:
[B]ecause human language is so complex and information-rich, computer programs for processing it must be fed enormous amounts of varied linguistic data---speech, text, lexicons, and grammars---to be robust and effective. Such databases are expensive to create and document, with maintenance and distribution adding additional costs. Not even the largest companies can easily afford enough of this data to satisfy their research and development needs. Researchers at smaller companies and in universities risk being frozen out of the process almost entirely.
In addition, shared resources permit the systematic comparison of alternative approaches by a community of researchers.
For pre-competitive research, shared resources also provide benefits that closely-held or proprietary resources do not. Shared resources permit replication of published results, support fair comparison of alternative algorithms or systems, and permit the research community to benefit from corrections and additions provided by individual users.

Where are we today?

As understanding of the need has developed, the set of generally-available resources for "building, improving or evaluating natural language and speech algorithms or systems" has grown dramatically.

In 1986, the only generally-available text corpus was the small (one million words) Brown corpus of American English. There were no generally-available speech corpora, nor were there any generally-available lexicons for computer use.

After ten years of effort to create shared resources, we have billions of words of available text in English, and significant quantities in about fifteen other langauges; we have almost two thousand hours of transcribed speech; we have lexicons such as WordNet, Comlex Syntax, and various pronouncing dictionaries; and we have new types of resources, such as task-oriented dialogues, corpora annotated at multiple levels of linguistic analysis, and large multilingual multimedia "document" collections.

The same period of time has seen an extraordinary flowering of language-related technology. Applications such as full-text information retrieval, speech recognition and generation, optical and on-line character recognition, and spelling and grammar checking, have moved from the laboratory into mass-market usage. On-going research and development in these areas continues to be fueled by the increasing availability of basic language resources.

The language research community has good reason to be proud of how far it has come, and should maintain the momentum of its successful enterprise into the future. However, a new set of challenges have arisen, along with the possibility of some new approaches to meet them.

What's new?

The wired world

During the past few years, the Internet reached critical mass and began to grow explosively, changing the way that Americans work, communicate and play in fundamental ways. Its impact on the structure of world society is likely to be greater than any invention since the telephone.

These changes have a profound impact on our national agenda. Universal access to the Internet by citizens to learn, communicate, create and publish has become a national priority. The growing internationalization of trade creates an urgent need for access to multilingual information. The armed forces are eager to harness the power of networked computing to improve planning and to cut through the "fog of war" on the battlefield. Businesses want to use analogous technology in their own operations.

New tasks for language technology

Human language technologies play a vital role in each of these areas. However, the wired world makes new demands on these technologies, demands that sometimes shift the focus to new capabilities.

For instance, exponentially-increasing amounts of multilingual text and speech are available on line. This creates a new need for "browsing quality" language translation, perhaps combined with "good enough" speech recognition. This is an old idea, but one for which there has never been very much of a market, since both the amount of on-line material and the population of potential readers were small. The requirements for language technology in this application are quite different from the requirements for translation technology that aims to replace human translators or increase their productivity.

As another example, spoken dialogue systems can play an important part in extending Internet access to the millions of Americans who are visually impaired. However, the requirements of such an application are quite different from the requirements of a traditional voice-mediated transaction systems, which are designed for a broader population of users but a much narrower domain of application.

There are exciting opportunities for applications of networked language technology in primary education. However, recognition of young children's voices poses special problems, and models of interaction must be adapted to the needs and capabilities of children at various levels.

As a final example, on-line collaborative work creates a mass of varied and variously-linked electronic documents: email, voice mail, images, spreadsheets, and so forth. Keeping track of these is often a daunting task for the participants. Making sense of it a year or two later can be a nightmare. This problem has connections with traditional issues of dialogue understanding and document summarization, but it frames the problems in a new way, and it also poses some entirely new problems.

Thus the growth of the Internet redefines most of the tasks of language technology, and at the same time, makes them more urgent, since the realistic population of potential users for a wide range of new language technologies is now tens and even hundreds of millions. We need new language resources to support research and development for these new needs, and we need them quickly -- on "Internet time."

Luckily, the same technological advances that created these needs can also help provide solutions. Both researchers and publishers can create and distribute language resources in new and more efficient ways, and communities of potential users can also be involved through new types of outreach.

New IPR Issues

In order to realize the potential of these new methods, we will have to develop creative solutions to new problems of intellectual property rights and informed consent.

As the Internet makes information more accessible, it also makes it more valuable. The enormous amounts of text, audio and video "lying around out there" nearly all belong to someone, according to U.S. and international intellectual property law. The fact that it is easy to collect large archives from the Net does not make it legal to give the resulting collection to someone else, any more than the ease of taping radio or TV broadcasts makes it legal to copy and distribute the tapes. The legal boundaries of permitted downloading, copying or even linking of Internet-accessible material remains an open question, but it is quite clear that wholesale collection and redistribution without license from the owner(s) is a violation of copyright law. Many companies and individuals are actively looking for electronic violators of their intellectual property, and a new industry of "net cops" is springing up to help them.

Since language technology has an insatiable appetite for linguistic data -- "the best data is more data," as the saying goes -- there is an inevitable collision between the needs of researchers and the rights of authors, publishers, broadcasters and other information providers.

There are several possible approaches to handling this problem:

  1. Individual research groups obtain special licenses for themselves, either for a fee, as a result of technical collaboration or business partnership with information providers, or as a gesture of goodwill. Many large companies have followed this path, as well as a few universities.
  2. Someone collects a corpus and obtains the right to sublicense it to researchers, under restrictions designed to protect the rights of the IPR owners, while not hampering research use excessively. Examples include the Brown corpus, the ACL/DCI corpus, and many collections carried out by the Linguistic Data Consortium (LDC).
  3. Material is collected and distributed that is free of IPR restrictions, especially from government sources.
  4. Material is collected and distributed for research purposes without paying any attention to IPR questions. This has happened several times in the recent past. Especially if any significant amount of distribution occurs, it generally is detected and stopped by threat of legal action.
The first approach -- proprietary collections -- is not relevant here, since we are talking about shared resources.

The second approach -- collection with sublicensing -- is the only way to create shared resources of many types, and therefore must be pursued. Creation and distribution of 'sublicensing kits', including sample legal agreements, might make it easier for this to be done more widely.

The third approach -- collection of unrestricted material, whether in the public domain or not -- needs to be pursued more aggressively, especially in the area of U.S. government materials. The vast majority of government-owned materials, including many useful and even crucial things, are not now available to language researchers. More agressive action should be taken to gain access to useful parts of this vast and growing store.

The last approach -- casual collection of broadcast or web-accessible material without licensing provisions -- is a dangerous temptation. It is easy to do, but it inevitably leads to a situation where money and research effort are tied up in a collection whose legal status is contested, and whose distribution is blocked by the threat of legal action.

Informed consent: protecting speakers and developers

The new situation also creates some new issues for informed consent of participants and subsequent protection of their rights, without unnecessary constraint on commercial deployment of new speech technologies.

It is easy to imagine a situation in which a school child, recruited as part of a project to document the speech patterns of first graders, while perhaps also teaching them about speech technology and providing new multimedia computers to their classroom, is surprised and upset to discover that her (strikingly cute) voice and face have been used as one of the optional personas for a new speech synthesis system distributed by a university in Belgium, and subsequently adopted for advertising purposes all over the net.

On the other side, we can imagine a situation in which the introduction of a new speech recognition system is hung up because one of the 10,000 speakers whose voices were used to train its acoustic models decides that (s)he is entitled to royalties, and there is no consent form on file to show that (s)he is not.
 

What should we do?

Stay on course

Recent research in fundamental areas of language technology has accomplished a great deal, but the basic problems are by no means completely solved. For instance, speech recognition technology is now good enough for many applications, but it is still much less robust and noise-resistant than human listeners are. Text analysis systems can get useful information from unrestricted text with accuracies far beyond anything that could have done just a few years ago, but they have a very long way to go.

Fundamental work on improving such technologies, which are basic components of tomorrow's applications as today's, needs to go forward, and the road to progress remains paved with linguistic data.

Create resources for urgent new tasks

At the same time, the changing world poses new challenges and opportunities. Discussion at the workshop highlighted three cases. While they do not constitute a complete list, they are important, and provide specific examples of the kinds of resources that are now needed.

Universal Acess

It is crucial and urgent to develop or adapt language technology to support universal access to the Internet.

Interactive systems can provide all people with access to the NII using natural communication skills---including people who cannot read or see, people who cannot hear or talk, and people who cannot move their hands or arms, among others.  These applications will require new resources for improved text generation and speech synthesis; better corpora of human-human information-access dialogues; new corpora supporting research in recognition and generation of facial cues and gestures during dialogues; and new corpora of speech-language-interaction data from groups that have been unrepresented in previous corpus collections.

Children in the early primary grades are a particularly important group to reach, to support learning of language skills and reading as well as education in other areas. Speech is a much more natural mode for them than text is, but current speech recognizers have trouble with young children's voices. A larger database of children's spoken interactions is needed, to support research in modeling children's interaction styles as well as for training recognizers to perform better on children's speech.

There is an important need to better understand human/human and human/computer interaction across a many task domains  and many types of people, so that interactive systems can be created that are capture the essential components of effective linguistic communication.  In addition to text- and voice-based interaction, the role of facial cues and gestures also need to be studied.

Additional effort must be made to understand and accomodate the needs of people who do not have the resources, skills or abilities to participate in the information age.  This includes disabled people, disadvantaged people,  young people, old people, and people who do not speak English.

Multimodal machine translation

Applications include English-language browsing of masses of multilingual documents, and computer-mediated multilingual communication (especially via the Internet). The growth of the Internet, and the increasing internationalization of business, creates a need (and thus a market) for such systems. Government needs for dealing with documents in many languages are also increasing. The existing machine translation technology is not good enough, and also costs too much to port to new languages.

Research advances in this area will require new parallel text corpora; additional multilingual speech and text data in large quantities; multilingual lexical resources, including lists of fixed expressions, proper names, and idioms; multilingual text annotated for word-sense and phrase structure; and resources for high-quality generation and multilingual speech synthesis.

Multimodal document understanding

We are using "document" here in an extended sense, that would cover (for instance) a video record of a meeting, or an archive of news broadcasts, or the electronic form of a conference proceedings volume.

Sample applications include extraction of reliable summary information from masses of on-line documents; the computer as moderator and facilitator of networked collaboration, using media such as email and "chat" links; and meeting summarization based on video and audio monitoring.

Again, the growth of the Internet creates an enormous need for such capabilities, in order to make use of the growing masses of diverse material on line.

Research advances in this area will require large new corpora in areas such as topic detection and tracking; new corpora of multimodal human-human interaction, including realistic examples of collaborative work; and also the development of standards for useful coding of such corpora, and their deployment in annotated corpora.
 

New modes of outreach

These important tasks are daunting. It has been difficult and expensive enough to deal with limited types of text, and limited types of audio-only interactions among limited classes of people. How can we possibly take the needed next steps?

Two new types of outreach promise to help with this dilemma.

First, there are several scientific communities that  have been largely excluded from language technology research but have much to offer, such as psychology, sociology, linguistics and anthropology.  Researchers in these areas -- more numerous by far than language technology researchers -- have long relied on their own (often large) bodies of interviews, experimental interactions, and so on. New channels of communication and resource sharing could take advantage of the accumulated knowledge, diverse viewpoints, and unique scientific methodologies in these areas, while also making new ideas and new tools available to them.

Second, the internet provides new opportunities for distributed, collaborative data collection, data annotation and data distribution. Along with smarter tools, these capabilities ought to make it possible to drastically lower the cost of many types of data collection. There are many difficult problems to be solved before we can be confident that we will really get more useful data for less money this way, but the potential rewards are great, and experiments of this type should be a high priority.

Improved legal and technical infrastructure for data sharing

A number of innovations would make it easier for researchers to share their language resources.

Licensing kits would help researchers set up appropriate terms for distribution, protecting the rights of their subjects, any external information providers, and any rights they themselves wish to reserve. Guidance is also needed to help researchers negotiate license agreements with external information providers, in cases where centralized negotiation is not appropriate. Kits for resource publication might also provide technical information, advice and assistance in database organization, file formats, and other standards. Some things of this type are already available, but they are not widely publicized, and also need improvement.

Improved standards, especially for overall database organization, would be helpful. Easily-available tools for language resource creation and access are crucial, along with good documentation and tutorial material for training and guidance in design and implementation.

It remains true that most government-sponsored language resources never leave the lab in which they were collected or created, except in the minimal form of summary tables in technical articles and books. With better support for overcoming the barriers to resource sharing, sponsors' pressure to share will be more effective. With better support for creation and utilitization of language resources by interested researchers, teachers and students, the pool of potentially sharable resources will become much larger.


Appendices

These are provided  in the form of hyperlinks.
  1. The Language Resources Primer prepared as background material for the workshop.
  2. Motivation and Goals: the task definition for the workshop participants.
  3. List of workshop participants.