University of Pennsylvania, Philadelphia, September 6-7, 2012
A little more than two decades ago, progress in human language technology (HLT) research and development was hampered by the dearth of available language resources. The application of machine learning to several HLT problems had showed promise but such approaches were data hungry. To the extent that relevant corpora existed, they were typically privately held and generally unknown except to the closest colleagues of the creators. Those who were willing to share their data had to contend with the lack of infrastructure to support their philanthropy.
In 1992, the Linguistic Data Consortium (LDC) was founded to fill the need for a repository and distribution point for large data sets used in HLT R&D. The idea of a cooperative membership-based, consortium of academic, research, government and commercial organizations sharing language resources was enthusiastically supported by potential hosts and members alike. While there were many suggestions in those early days about what such a consortium could accomplish, it was generally agreed that its principal objective would be to facilitate ways in which large scale data collection efforts could be used more effectively in support of realistic applications.
LDC addressed that challenge in several ways: by launching the first comprehensive language resource catalog which currently contains over 500 holdings in over 60 languages in a variety of genres; by supporting common task technology evaluations with custom datasets; by establishing and maintaining long-term data collections of text, speech and video across genres; by developing key resources for the community such as lexicons and large text collections; by commencing its annotation operations and the associated development of tools, specifications, guidelines and best practices in and across programs; and by consistently providing a high level of service to its members. Since this early success, the field has witnessed the establishment of other consortia and networks with similar goals as well as those focusing on individual languages and technology programs.
Twenty years later, the landscape looks very different. Computer technologies have grown from being specialist equipment to being commodities to becoming components of appliance and similar devices. Storage, processing and display capabilities that were almost unimaginable in the early 90s are now widespread. Technologies mediate a large and increasing percentage not only of business transactions but also of social interactions. Crowd-sourcing, micro-marketing and cloud computing are destroying the last of the traditional workforce partitions. The number of human languages present on the Internet continues to grow. HLTs such as speech-to-text, text-to-speech and machine translation are now salient in the market. Open source toolkits enable further development of HLTs for new languages and data challenges.
Some have take these changes to suggest that data accessibility is no longer a challenge that requires shared infrastructure but can be addressed through individual, independent efforts. But is that true? And if it is, what role if any do language data centers play in this their third decade?
To commemorate the 20th anniversary of the LDC and the beginning of language data centers, this workshop addresses the future of language resources. However, in order to move forward with deliberation it is also necessary to understand the history of the field and the motivations for the paths it has taken.
Workshop themes include: the developments in human language technologies and associated resources that have brought us to our current state; the language resources required by the technical approaches taken and the impact of these resources on HLT progress; the applications of HLT and resources to other disciplines including law, medicine, economics, the political sciences and psychology; the impact of HLTs and related technologies on linguistic analysis and novel approaches in fields as widespread as phonetics, semantics, language documentation, sociolinguistics and dialect geography; and finally, the impact of any of these developments on the ways in which language resources are created, shared and exploited and on the specific resources required.
[ top ]
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data