Introduction to the Linguistic Data Consortium
There is increasing interest in computer-based linguistic technologies, including speech recognition and understanding, optical and pen-based character recognition, text retrieval and understanding, machine translation, and the use of these methodologies in computer assisted language acquisition. In each area, we have useful present-day systems and realistic expectations of progress.
However, because human language is so complex and information-rich, computer programs for processing it must be fed enormous amounts of varied linguistic data---speech, text, lexicons, and grammars---to be robust and effective. Such databases are expensive to create and document, with maintenance and distribution adding additional costs. Not even the largest companies can easily afford enough of this data to satisfy their research and development needs. Researchers at smaller companies and in universities risk being frozen out of the process almost entirely.
For pre-competitive research, shared resources also provide benefits that closely-held or proprietary resources do not. Shared resources permit replication of published results, support fair comparison of alternative algorithms or systems, and permit the research community to benefit from corrections and additions provided by individual users.
Until recently, most linguistic resources were not generally available for use by interested researchers. Because of concern for proprietary rights, or because of the additional burdens of electronic publication (which include preparation of a clean and well-documented copy, securing of clear legal rights and drafting of necessary legal agreements, and subsequent support), most of the linguistic databases prepared by individual researchers either have remained within a single laboratory, or have been given to some researchers but refused to others.
A few notable examples over the years have demonstrated the value of shared resources, but until recently, these have been the exceptions rather than the rule. For example, the Brown University text corpus has been used by many researchers, to the point of being adopted as a generally-available test corpus for evaluating statistical language models of English. The importance of shared data for evaluation of speech technology was shown by the TI-46 and TI DIGITS databases, produced at Texas Instruments in the early 1980's, and distributed by the National Institute of Standards and Technology (NIST) starting in 1982 and 1986 respectively. The U.S. Defense Department's Advanced Research Projects Agency (ARPA) began using a ``common task'' methodology in its speech research program in 1986, creating a series of shared databases for algorithm development and evaluation. This approach led to rapid progress in speech recognition, and has since been applied to research in message understanding, document retrieval, speech understanding, and machine translation.
Building on these successes, the Linguistic Data Consortium (LDC) was founded in 1992 to provide a new mechanism for large-scale development and widespread sharing of resources for research in linguistic technologies. Based at the University of Pennsylvania, the LDC is a broadly-based consortium that now includes more than 100 companies, universities, and government agencies. Since its foundation, the LDC has delivered data to 197 member institutions and 458 non-member institutions (excluding those who have received data as a non-member and later joined).
Much of this data is copyrighted by publishers, broadcasters and others, so its distribution by the LDC for purposes of research, development and education is covered by more than 50 separate IPR (Intellectual Property Rights) contracts between the University of Pennsylvania and data providers, and several thousand contracts between Penn and data recipients. Five years of data distribution and use without major IPR disturbances has created substantial credibility among publishers, broadcasters and researchers for this mode of operation.
An initial three-year grant from DARPA amplified the effect of contributions (both of money and of data) from this broad membership base, so that there was guaranteed to be far more data than any member could afford to produce individually. On-going funding from the National Science Foundation has permitted creation of a web-searchable on-line repository for all LDC-data, and the development of new resources for several government-funded research projects. In addition to distributing previously-created databases, and managing the development of new ones, the LDC has helped researchers in several countries to publish and distribute databases that would not otherwise have been released.
The operations of the LDC are closely tied to the evolving needs of the research and development community that it supports. Since research opportunities will increasingly depend on access to the consortium's materials, membership fees have been set at affordable levels, and membership is open to research groups around the world.
As required by the terms of the original DARPA grant and the current NSF cooperative agreement, the core consortium operations of the LDC are now fully self-supporting. This includes maintaining the data archives, producing and distributing CD-ROMs, and arranging networked data distribution, as well as negotiating intellectual property agreements with potential information providers and with would-be members, and maintaining relations with other groups around the world who gather and/or distribute linguistic data, hosting occasional workshops, and so forth. It also includes pre-publication processing of data donated by other groups, the production of small or inexpensive databases, and pilot work on larger projects projects in advance of other funding. It includes much of the planning and overseeing of specific databases funded by outside sources. Finally, it includes various forms of cost-sharing to make the production of databases funded by outside entities (whether governmental or commercial) more efficient.
Several parallel or related efforts are underway in Europe and the Far East. Productive relationships are being developed between the LDC and these activities, on the principle of open access across continental boundaries to the raw materials of technological progress. The road to the future of linguistic technology is (so to speak) paved with data, and the LDC will serve as part of an international highway system, providing a common infrastructure necessary for world-wide progress in research and development.
What do we mean by linguistic data?
The core of the problem is the complexity and richness of human language. There are many languages, each containing many words, which combine into messages in intricately restricted ways. Each word, in turn, corresponds to many sounds, depending on the surrounding words, on the speaker's age and sex and dialect, and on style, nuance and setting. It takes a lot of human experience to learn a language, and it takes a lot of data to ``teach'' one to a computer.
This is true even if we try to pre-digest the data, and ``teach'' the computer just a list of rules and exceptions---the list winds up being a very long one, and the most useful rules seem to be those that accurately reflect the rich structure of actual linguistic experience. Language and speech are usually ambiguous in ways that we never even notice. We talk and listen, read and write as if our speech and text were perfectly clear, oblivious to the intricate web of uncertainty that plagues a computer program trying to imitate our behavior. The most effective way to reduce this uncertainty---and the bizarre, inhuman errors it produces---is to furnish the computer with a great deal of information about what ordinary human language is usually like.
This process, which applies at every level of linguistic analysis, is easiest to explain in the case of words in text. For instance, both "last" and "lost" are adjectives in English, and thus could modify the noun "year", but "last year" occurs in news-wire text more than 300 times per million words, while "lost year", although perfectly well-formed and even sensible, is vanishingly unlikely. What is usually "lost" is "ground", "souls", "productivity", or "wages", while "ground", if not "lost", is likely to be "high".
These stereotypical connections are amusing, but there is a serious point: an optical character recognition (OCR) system, unsure whether a certain letter is "o" or "a", can very safely bet on "a" if the context is "l_st year", but on "o" if the context is "l_st souls". A more complete set of such expectations about local word sequences can greatly reduce the effective uncertainty of a letter in English text.
Representing characters with 8-bit bytes allows 256 possibilities for each letter; in this case, we say our "perplexity" (in a technical sense) is 256. The ASCII coding standard, commonly used for English text, has 95 printing characters. Allowing for the differing frequencies of these different characters in ordinary text, independent of context, reduces our perplexity to about 32. In 1951, Claude Shannon estimated the true perplexity of English text at about 2.4 possibilities per character, based on an analysis of human guessing. Mathematical models based on predictions from the frequencies of three-word sequences (known as trigrams) are currently achieving perplexities fairly close to Shannon's estimate, around 3.4 possibilities per character. Using such models, we can guess characters in English text correctly about one time in three, as opposed to one time in 95 if we guess at random among the printable ASCII characters.
The effect of this reduction in uncertainty is to make a recognition task---such as OCR---many times easier, with a correspondingly large improvement in system performance. The same technique---improving performance by reducing uncertainty about word sequences---plays a crucial role in most speech recognition applications, and can also be used to store text data in a minimum amount of space, or to transmit it in a minimum amount of time. This is one of the simplest examples of the value of linguistic data in improving the performance of linguistic technologies.
A model of this kind needs tens or even hundreds of millions of words of text to derive useful estimates of the likelihoods of various word sequences, and its performance will continue to improve as its training set grows to include billions of words. To put these numbers in perspective, consider that a typical novel contains about a hundred thousand words, so that we are talking about the equivalent of hundreds or even thousands of novels. It is not easy, even today, to obtain this much text in computer-readable form.
In addition, different sorts of text have different statistical properties---a model trained on the the Wall Street Journal will not do a very good job on a radiologist's dictation, a computer repair manual, or a pilot's requests for weather updates. The sequences input to a pen-based or speech-based interactive system when a user is entering a business letter will be quite different from those that a desk calculator or a spreadsheet sees. This variation according to style, topic and application means that different applications benefit from models based on appropriately different data---thus there is a need for large amounts of material in a variety of styles on a variety of topics---and for research on how best to adapt such models to a new domain with as little new data as possible.
This same topic-dependent variation can be used to good advantage in full-text information retrieval, since words and phrases that occur unusually often in a document tell us a lot about its content. Thus the ten most unexpectedly-frequent words in a book entitled "College: the Undergraduate Experience" are "undergraduate, faculty, campus, student, college, academic, curriculum, freshman, classroom, professor"; we are not surprised to learn that "quilt, pie, barn, farm, mamma, chuck, quilting, tractor, deacon, schmaltz" are the ten most unexpectedly-frequent words in a novel with a rural setting, or that "dividend, portfolio, fund, bond, investment, yield, maturity, invest, volatility, liquidity" characterize "Investing for Safety's Sake", while "The Art of Loving" yields "motherly, separateness, love, fatherly, paradoxical, brotherly, faith, unselfishness, erotic, oneself."
Of course, there is much more to text structure than just counts of words or word sequences. For instance, in analyzing the way that words go together into sentences, we can take account of the typical connection between verbs and their subjects, objects, instruments, and so on. Thus if we ask (based on a few million words of Associated Press news-wire text) what verbs have a special affinity for the noun "telephone" as their object, the top of the list is "sit by, disconnect, answer, hang up, tap, pick up, be by". Such ``affinity measures'' among words in phrases can be used to help resolve the otherwise-ubiquitous ambiguities in analysis of text structure, so that "he sat for an hour by the hall telephone" is suitably differentiated from "he sat for a portrait by the school photographer."
Text, diverse and ambiguous as it is, is simple and straightforward compared to the universe of speech. Here the comfortable simplicity of the alphabet is replaced by continually-varying, limitlessly-varied sounds, modulated by processes belonging to physics, physiology, and sociology alike. We have a long way to go to reach entirely adequate models of human speech, but the foundation of all progress so far has been the careful fitting of appropriate models to large amounts of data. Through this process, the model is ``trained'' to incorporate as much of the language's sound patterns as the model's structure and the amount and quality of data permit. For the process to work well, the training data must reflect the expected properties of the task, so that words (or the sounds that make them up) must be pronounced by enough different kinds of people in enough different kinds of messages to sample the real variability of the final application.
Properly designed, such models can then be used to recognize, synthesize or encode speech, and their performance can be evaluated quantitatively on more data of the same sort they were trained on. Improvements in performance can come in three ways: better models, better fitting techniques, or more data. Usually, experiments with new models and new fitting techniques also require new data to be carried out properly. Thus to a large extent the pace of progress in speech technology, especially in the area of speech recognition, has been determined by the rate at which new speech data has become available.
Common sense and experience alike specify the benefits of data-driven research in linguistic technology: it gives us the basis for modeling some of the rich and intricate patterns of human speech and language, by brute force if no better way is devised; it permits quantitative evaluation of alternative approaches; and it focuses our attention on problems that matter to performance, instead of on problems that intrigue us for their own sake.
In discussions about future applications of linguistic technology, it sometimes seems as if the aim is a wholesale replacement of the keyboards and mice of current interactive computers by means of speech or handwritten input, and perhaps the replacement of display screens by means of voice output as well. Although there are doubtless many circumstances in which spoken conversation with a computer is the right solution, and many other circumstances in which pen-based input is the right approach, there are still keyboards and (especially) screens in any future that we can see clearly today. Understanding the role of linguistic technology in the future of our society requires consideration of some larger issues.
We humans spend much of our lives speaking and listening, reading and writing. Computers, which are more and more central to our society, are already mediating an increasing proportion of our spoken and written communication-- in the telephone switching and transmission system, in electronic mail, in word processing and electronic publishing, in full-text information retrieval and computer bulletin boards, and so on.
Because information storage and processing is getting cheaper by a factor of a thousand or so every decade, we know that computer technology will continue to penetrate and reshape society for some time to come, as today's expensive laboratory curiosities become tomorrow's electronic commodities.
These trends create an enormous economic and social opportunity for natural language and speech technology. Where computers are already involved in creating, transmitting, storing, searching, or reproducing speech and text, we have the chance, at little marginal cost, to add new features that improve the quality of the process or that increase the productivity of the human labor involved. Simple examples of this kind include the use of spelling correctors in word processing; the use of speech technology to reduce the workload of telephone attendants, by screening calls with voice recognition or providing information through voice synthesis; and the use of machine-aided translation (MAT) programs, which make human translators more productive by providing a rough draft to be corrected. In other cases, linguistic technology improves on other solutions-- for instance, hands-free voice control of industrial inspection stations is often faster and more accurate than alternative methods.
It is easy to project future extensions of such linguistic technology, such as the human-like voice communications of computers in "2001" and "Star Trek"; but predicting the details of life even twenty years from now raises questions about the development and social acceptance of many complex technologies. What will be the roles of input/output methods such as voice, video, keyboard, mouse, pen-based systems, direct interpretation of gesture and gaze? To what extent will information flow as full text, as fielded records, as sound, as images? How will computer networking, telephone systems and cable TV divide up the world of information transmission? In what directions will wireless telecommunications develop? We don't know for sure---we probably don't even know how to pose the questions in the right way.
Thus speculating today about the interactive computer of 2012 may be rather like a 1970 discussion of the keypunch technology of 1990. We do know a few things about the world of 2012, however: people will definitely still like to talk, and computers will probably still be getting cheaper. Therefore, human communication will still be heavily based on speech and text, and computers will be involved in increasingly sophisticated ways. This guarantees that any basic improvements in linguistic technology will find important social and economic roles.
The problems of collecting, processing and annotating the needed quantities of linguistic data are too large for any one company. In any case, it is inefficient to duplicate the large up-front investment in foundations of research whose commercial implications will not blossom fully for ten years or so (even though the first applications have emerged already). Furthermore, university researchers and small companies, traditionally among the most important sources of technical innovation, may be frozen out entirely.
Thus a consortium is the right way to harness both economies of scale and individual creativity. We get economies of scale by avoiding duplication of effort on fundamentals, by doing the job in a way that serves the broadest range of needs, and by re-using data in a variety of technical areas. We improve the opportunities for individual creativity by making fundamental resources and tools available to researchers in academia and smaller companies.
Such an organization is also well adapted to recent international developments in the structure of language technology research. Over the past half-dozen years, we have seen the development of a new management paradigm for pre-competitive development of language technology. This new paradigm has taken somewhat different forms in different countries. In Japan, new task-oriented laboratories have been founded, notably the ATR Interpreting Telephony Research Laboratories in Kyoto, at which researchers from many companies work together on a common project. The German Verbmobil project is defining a common speech-to-speech translation task. Its goal of combining speech and language technologies in a negotiation dialogue brings together some of the most prominent universities and industrial labs in Germany (and potentially around the world).
In the USA, the ARPA Human Language Technology (HLT) program has made systematic and effective use of a ``common task'' method. This approach begins each project by specifying a task, defining a formal, quantitative evaluation metric, and developing a large common database for training and testing purposes. Then each participant pursues solutions in an individual way, and all participants meet periodically to compare methods and results (including evaluation scores). Used since 1986, this technique has resulted in rapid performance improvements in several areas. For example, speech recognition word error rate has been cut in half every two years for the past six years. Similarly, the performance of (text) message understanding and retrieval systems, measured in terms of metrics such as precision and recall, has improved at a rate between 20-50% per year. Common tasks have also been an effective method to engender cooperation and the productive exchange of ideas and techniques.
Such common-task techniques, used in different ways in the three countries cited, are effective because:
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data