20th Anniversary Workshop

The Future of Language Resources: LDC 20th Anniversary Workshop

University of Pennsylvania, Philadelphia, September 6-7, 2012

LDC hosted a workshop to celebrate its 20th Anniversary on September 6-7, 2012 at the University of Pennsylvania's Houston Hall [1].

Overview

A little more than two decades ago, progress in human language technology (HLT) research and development was hampered by the dearth of available language resources. The application of machine learning to several HLT problems had showed promise but such approaches were data hungry. To the extent that relevant corpora existed, they were typically privately held and generally unknown except to the closest colleagues of the creators. Those who were willing to share their data had to contend with the lack of infrastructure to support their philanthropy.

In 1992, the Linguistic Data Consortium (LDC) was founded to fill the need for a repository and distribution point for large data sets used in HLT R&D. The idea of a cooperative membership-based, consortium of academic, research, government and commercial organizations sharing language resources was enthusiastically supported by potential hosts and members alike. While there were many suggestions in those early days about what such a consortium could accomplish, it was generally agreed that its principal objective would be to facilitate ways in which large scale data collection efforts could be used more effectively in support of realistic applications.

LDC addressed that challenge in several ways: by launching the first comprehensive language resource catalog which in 2012 contains over 500 holdings in over 60 languages in a variety of genres; by supporting common task technology evaluations with custom datasets; by establishing and maintaining long-term data collections of text, speech and video across genres; by developing key resources for the community such as lexicons and large text collections; by commencing its annotation operations and the associated development of tools, specifications, guidelines and best practices in and across programs; and by consistently providing a high level of service to its members. Since this early success, the field has witnessed the establishment of other consortia and networks with similar goals as well as those focusing on individual languages and technology programs.

Twenty years later, the landscape looks very different. Computer technologies have grown from being specialist equipment to being commodities to becoming components of appliance and similar devices. Storage, processing and display capabilities that were almost unimaginable in the early 90s are now widespread. Technologies mediate a large and increasing percentage not only of business transactions but also of social interactions. Crowd-sourcing, micro-marketing and cloud computing are destroying the last of the traditional workforce partitions. The number of human languages present on the Internet continues to grow. HLTs such as speech-to-text, text-to-speech and machine translation are now salient in the market. Open source toolkits enable further development of HLTs for new languages and data challenges.

Some have take these changes to suggest that data accessibility is no longer a challenge that requires shared infrastructure but can be addressed through individual, independent efforts. But is that true? And if it is, what role if any do language data centers play in this their third decade? To commemorate the 20th anniversary of the LDC and the beginning of language data centers, this workshop addresses the future of language resources. However, in order to move forward with deliberation it is also necessary to understand the history of the field and the motivations for the paths it has taken.

Submissions

Final presentations can be found on the Program page.

Confirmed Speakers

M. Keith Chen, Yale University
Lyle Ungar, University of Pennsylvania
Jack Godfrey, Johns Hopkins University, Center of Excellence
Judith Klavans
Edouard Geoffrois, DGA & ANR (France)
Joseph Mariani, LIMSI-CNRS & IMMI (France)
Jiahong Yuan, University of Pennsylvania
Chris Callison-Burch, Johns Hopkins University
Dean Foster, University of Pennsylvania
Khalid Choukri, ELRA/ELDA (France)
Brian MacWhinney, Carnegie Mellon University
Steven Krauwer, CLARIN ERIC, Utrecht University (Netherlands)
John Coleman, University of Oxford (United Kingdom)
Steven Bird, University of Melbourne (Australia), University of Pennsylvania
Salim Roukos, IBM
Katie Drager, University of Hawai‘i at Mānoa
Jerry Goldman, Chicago-Kent College of Law, oyez.org
Joseph Picone, Temple University
Brian Carver, University of California, Berkeley
Keelan Evanini, Educational Testing Service
Bonnie Dorr, DARPA
Mary Harper, IARPA
Jan Hajic, Charles University in Prague (Czech Republic)

Registration

Online registration is now closed.

Venue and Accommodation

Venue

The Workshop will be held on the University of Pennsylvania's campus in Philadelphia, PA. Workshop sessions will take place in the Class of 1949 Auditorium in Houston Hall and coffee breaks will take place in Houston Hall's Hall of Flags or in Claudia Cohen Hall's Terrace Room. Walking directions from either hotel to the workshop venues are available. Breakfasts are provided but lunch is on your own. A list of restaurants and lunch trucks near the venue is available.

Accommodation

LDC has negotiated a group rate for a block of rooms at the University City Sheraton [3] for the nights of September 5th through 7th. Please book your accommodation before August 6, 2012 as all unreserved rooms will be released on that date. Please reference the Linguistic Data Consortium when you are booking to receive the discounted rate.

The Hilton's Inn at Penn [4] is also within walking distance to the conference venue. Both hotels are .3 miles, approximately a ten-minute walk, from Houston Hall.

Workshop Organizing Committee

Mark Liberman, Linguistic Data Consortium
Christopher Cieri, Linguistic Data Consortium
Denise DiPersio, Linguistic Data Consortium
Marian Reed, Linguistic Data Consortium
Marisa Lantieri, Linguistic Data Consortium