LDC 20th Anniversary Workshop: September 6-7 2012
Thursday, 6 September 2012
8.00 – 9.00: Registration and Breakfast in Claudia Cohen Hall (CCH) Terrace Room
(all workshop sessions will take place in the Class of 1949 Auditorium, Houston Hall)
9.00 – 9.05: Welcome Address - Andy Binns, Vice Provost for Education
9.05 – 9.35: LDC Introductions
9.35 – 10.35: Morning Session 1.1 - Uses and Applications of Language Resources I
9.35 – 9.55: M Keith Chen, Yale University - "Linking Linguistic Data and Economic Decision Making" Presentation in pdf
9.55 – 10.15: Keelan Evanini, Educational Testing Service - "The Future of Shared Resources for the Automated Assessment of Spoken and Written Language" Presentation in pdf
10.15 – 10.35: Lyle Ungar, Penn CIS, - "Word Use, Personality and Subjective Well-being" Presentation in pdf
10.35 – 11.10: Coffee Break in CCH Terrace Room
11.10 – 12.30: Morning Session 1.2: Programs, Centers and Infrastructure I
11.10 – 11.30: Jack Godfrey, Johns Hopkins University - "No Program? No Problem! LDC, NIST, and a Golden Age of progress in Speaker ID" Presentation in pdf
11.30 – 11.50: Judith Klavans, U.S. Government- "Roots and Branches: The Origins and Growth of Treebanks" Presentation in pdf
11.50 – 12.10: Edouard Geoffrois, DGA & ANR - "The Need for Public Data Centers"
12.10 – 12.30: Joseph Mariani, LIMSI-CNRS & IMMI - "An Historical Perspective on Language Resources and Evaluation in Europe" Presentation in pdf
12.30 – 2.10: Lunch
2.10 – 3.10: Afternoon Session 1.3: Novel Approaches I
2.10 – 2.30: John Coleman, University of Oxford - "Linked Speech in Crowd/Cloud Corpus Consortia" Presentation in pdf
2.30 – 2.50: Chris Callison-Burch, Johns Hopkins University - "The Promise of Crowdsourcing" Presentation in pdf
2.50 – 3.10: Dean Foster, UPenn Statistics - "Eigenword-based language models from large corpora" Presentation in pdf
3.10 – 3.45: Closing Remarks, Directions to Banquet venue
5.30 – 9.30: Banquet
Friday 7 September 2012
8.00 – 9.00: Registration and Breakfast in Houston Hall, Hall of Flags
9.00 – 9.30: LDC Introductions
9.30 – 10.30: Morning Session 2.1: Programs, Centers and Infrastructure II
9.30 – 9.50: Khalid Choukri, ELRA/ELDA - "Language Resources ... by the Other Data Center: 15 Years of a Fruitful Partnership" Presentation in pdf
9.50 – 10.10: Bonnie Dorr, Program Manager DARPA Information Innovation Office - "Evolution of Data Needs in DARPA Language Programs" Presentation in pdf
10.10 – 10.30: Steven Krauwer, CLARIN ERIC, Utrecht University - "Serving the Humanities: Daydreams and Nightmares" Presentation in pdf
10.30 – 11.00: Coffee Break in Hall of Flags
11.00 – 12.20: Morning Session 2.2: Novel Approaches II
11.00 – 11.20: Steven Bird, University of Melbourne and University of Pennsylvania - "Language Engineering in the Field: Preserving 100 Endangered Languages in New Guinea" Presentation in pdf
11.20 – 11.40: Salim Roukos, IBM Research - "Twenty Years Means 10,000X"
11.40 – 12.00: Katie Drager, University of Hawai'i at Mānoa - "New Directions in Sociolinguistic Methods" Presentation in pdf
12.00 – 12.20: Jiahong Yuan, University of Pennsylvania - "What Can We Learn about Mandarin Tones from Large Speech Corpora?"
12.20 – 1.50: Lunch
1.50 – 2.50: Afternoon Session 2.3: Uses and Applications of Language Resources II
1.50 – 2.10: Jerry Goldman, Chicago-Kent College of Law , oyez.org - "Beyond SCOTUS: The Mountain Range Ahead" Presentation in pdf
2.10 – 2.30: Joseph Picone, Temple University - "The Neural Engineering Data Consortium: Deja Vu All Over Again" Presentation in pdf
2.30 – 2.50: Brian Carver, University of California, Berkeley - "Collecting and Distributing Legal Corpora and their Citation Graphs" Presentation in pdf
2.50 – 3.20: Coffee Break in Hall of Flags
3.20 – 4.40: Afternoon Session 2.4: Programs, Centers and Infrastructure III
3.20 – 3.40: Jan Hajic, Charles University in Prague - "From Morphology to Semantics: the Prague Dependency Treebank Family" Presentation in pdf
3.40 – 4.00: Mary Harper, IARPA - "Data Resources to Support the Babel Program" Presentation in pdf
4.00 – 4.20: Brian MacWhinney, Carnegie Mellon University - "Data Centers, Sharing, Standards, and Linkage: Toward the New Infrastructure" Presentation in pdf
4.20 – 4.40: Open Discussion
4.40 – 5.00: Closing Remarks
Languages differ widely in the ways they partition time. In this paper I test the hypothesis that languages that do not grammatically distinguish between present and future events (what linguists call weak-FTR languages) lead their speakers to take more future-oriented actions. First, I show how this prediction arises naturally when well-documented effects of language on cognition are merged with models of decision making over time. Then, I show that consistent with this hypothesis, speakers of weak-FTR languages save more, hold more retirement wealth, smoke less, are less likely to be obese, and enjoy better long-run health. This is true in every major region of the world and holds even when comparing only demographically similar individuals born and living in the same country. The evidence does not support the most obvious forms of common causation. I discuss implications of these findings for theories of intertemporal choice.
In the presentation, I will introduce the types of language resources that are necessary for developing systems for the automated assessment of spoken and written language proficiency. Then, I will present the details of some public and privately-held corpora that have been developed for this purpose, and make the case that an increased release of privately-held corpora would benefit the field as a whole. Finally, I will talk about recent efforts in developing public shared tasks for this field, and the lessons we have learned from them for designing similar shared tasks in the future.
The words people use on social media such as Twitter, Facebook and Google provide a rich, if imperfect, source of information about their personality and psychological state. We use Facebook posts and personality test results from over 100,000 volunteers to characterize how Facebook word use varies as a function of age, sex, IQ, and personality, providing interesting insights into people's thoughts and concerns.
When the government, especially DoD, has language technology needs which require R&D, it uses sponsored programs at DARPA or the military services or agencies. Automatic Speech Recognition (ASR) is an example, perhaps the paradigm, for such efforts, and LDC has played a central role in all ASR programs since 1992. In fact, in those two decades, there have been at least six major ASR programs, all relying on "Big Data" from the LDC. Speaker and Language ID have always stood in the shadow of ASR -- there have only been two modest USG research programs in the last 20 years. Nevertheless, progress in these technologies has been truly remarkable, especially in the last decade, with great technical progress and multiple satisfied government customers. While it's not exactly the tortoise and the hare, a story can be told that modest but steady funding was unusually successful in advancing Speaker ID, mainly because of a "virtuous cycle" involving NIST, LDC, MIT-Lincoln Lab, and DoD, all playing different roles than, for example, in DARPA programs. There was competition, but also a freedom to explore new territory, not typical of big programs, all while developing useful technology. Examples are HASR, forensic SID, crosslingual SID, and PPRLM.
In the spring of 1990, in a stuffy room in the CS department at UPenn, a group of about 15 researchers working on computational syntax convened to address the prospect of coming up with a single agreed upon annotation "standard" for an on-line repository of syntactically marked up sentences. The goal was to arrive at an acceptable consensus across theories and approaches (dependencies or not? Lexical-functional relations or not?) for the good of the larger computational linguistics community, many of whom were simply interested in using the output of this effort to train systems, and to understand syntactic peculiarities of languages other than English.
Luckily, Ezra Black, then at IBM Research, created a publication that reflected some of this discussion, the livelier parts of which were most likely unpublishable. My goal today is to provide some retrospective perspectives on how the LDC came about with respect to the TreeBank projects, as well as to comment on impact and future directions.
Multilingualism is a specific dimension of the European Union, which involves nowadays 27 Member States and 23 Official Languages. Language Technologies would be crucial to allow for multilingualism in Europe. Their development implies the availability of Language Resources and Language Technology Evaluation (LRE).
The European activities in this area started in the early 1980s within the NATO RSG10 Group and in the early 1990s through initiatives involving the newly created European Speech Communication Association (ESCA), such as the creation of COCOSDA, the Coordinating Committee on Speech databases and Speech I/O Systems Assessment. ELRA, the European Language Resources Association, was founded in 1995, and initiated the Language Resources and Evaluation Conference (LREC) in 1998.
LRE was especially considered in France, with the activities conducted within the FRANCIL Network in the Francophone countries starting in 1994, extended within the TechnoLangue programs managed by the French Ministry of Research, then in several specific projects supported by the French Research Agency (ANR). Nowadays, the large Quaero program is structured around the systematic evaluation of the technologies and applications developed within the program, and a project is specifically related to the production of the necessary corpus.
The European Commission supported several projects of limited duration on LRE, including recently CLARIN, for Human and Social Sciences, FLaReNet, aiming at fostering the use of LR, and META-NET, on Machine Translation and Multilingual Technologies. The situation is very different among the various European languages and countries, and sharing the research effort among the European Member States and the EC would be beneficial for the European researchers and for the European citizens at final.
The LDC acts as both cataloguer of and repository for very large amounts of spoken language data. It is a superb sound archive. But even larger collections of recorded speech reside elsewhere, in libraries and archives, and in other repositories of linguistic data. In this presentation I will examine some of the challenges involved in finding and accessing fragments of speech recordings that are held across multiple sites.
The availability of very large speech corpora offers many exciting opportunities for doing research in phonetics and linguistics. New tools and methods are needed for analyzing large speech corpora. In this talk, I will present a corpus study of 3rd tone sandhi and voice quality in Mandarin tones. The study demonstrates the use of speech technology methods for large-scale phonetics research.
I will discuss the promise of crowdsourcing for creating speech and language data. I will detail a number of my own recent experiments using Amazon Mechanical Turk to create bilingual parallel corpora for a variety of languages. I will discuss the challenges of crowdsourcing, including quality control, conveying complex tasks to lay users, processional v. non-professional annotation, and the advantages, including scalability and access to a worldwide workforce with diverse language skills.
Using gigaword and larger corpora requires efficient computational methods. We present a new singular value decomposition (SVD)-based method that projects words to "eigenwords" lying in a low dimensional space in which distributionally similar words are close to each other.
Eigenwords capture word meaning, and can be context-sensitive, allowing for word sense ambiguity. Unlike the EM and Gibbs sampling methods used to estimate HMMs, our SVD-based eigenword method provides guaranteed fast convergence with strong sample complexity guarantees. We show that in a number of natural language processing (NLP) applications including parsing and named entity recognition, eigenwords can be used to improve performance of state-of-the art prediction systems.
The talk will focus on a brief description of the rationales behind the set-up of ELRA; a European concern and a mimic of the US initiative as well as an instrument to boost the then uprising cooperation. The first years of ELRA activities were devoted to identification and cataloguing of Language Resources (LRs), shifting quickly to negotiation of distribution rights to ensure a wide availability of LRs with cleared and clean IPR. While playing this broker role, ELRA has anticipated the needs of its members, moving towards production of resources on demand, both for R&D projects and for customers. To accompany such productions, ELRA dedicated considerable efforts to specify and implement quality control and validation of LRs processes. In order to conduct such task, ELDA did set up validation units that did carry quality assessment. Such quality control led also to collection of bug reporting and correction patches. In order to automate such production and offer cost-effective approaches, ELRA is involved in several collaborative projects that are designing and implementing production platforms such as Panacea (www.panacea-lr.eu) that aims at automatic cost-effective acquisition of Language Resources for Human Language Technologies.
Following the availability of LRs, key players expressed their needs to have their developed technologies evaluated. ELRA devoted efforts to set up evaluation platforms and to conduct evaluation campaigns, covering over 20 technologies so far. ELRA continue to be involved is some of the major European and/or national initiatives e.g. CLEF, MT-evals, Audio-visual evaluations, etc. ELRA has been able to capitalize on such initiative to package resources, metrics, methodologies, into evaluation kits that are rendered available for those wishing to conduct similar experiment outside the official evaluation framework. In order to disseminate information on technology evaluation, ELRA has set up a Portal that compiles Information on past and present evaluation projects and campaigns, on existing evaluation packages, on evaluation services, etc. (www.hlt-evaluation.org).
These activities of LR identification, distribution, maintenance, etc. allowed an impressive return on investment for all EU players. While in the past similar resources were funded many times and often lost few years after their use, ELRA managed to maintain a lively catalogue that comprises most of the resources that were publicly funded (as well as many that were privately funded). Such resources are offered to third parties under a clear legal framework but under fair market conditions. Such conditions became a major concern and source of frustration for most of the academic users considering the evolution of the web and the associated business models and trends.
To consider and anticipate these trends of language resources sharing, ELRA joined several important initiatives, the most recent one being META-NET. META-SHARE considers the new spirit of peer to peer and distributed repositories of data and tools and worked to set up an open, integrated, secure, and interoperable exchange infrastructure for language data and tools for the Human Language Technologies domain, known as META-SHARE (http://www.meta-share.org/).
META-SHARE is designed as a "marketplace" where language data and tools are documented, uploaded and stored in repositories, catalogued and announced, downloaded, exchanged, discussed, with an aim to support a data economy (free and for-a-fee LRs/LTs and services). It brings together all players within the EU and their associated initiatives.
Some of the key objectives of META-SHARE are derived from the community concerns about LRs visibility, documentation, identification, availability, preservation, interoperability; a long-term endeavor to boost research, technology and innovation through wide availability, pooling, openness and sharing.
META-SHARE has achieved three major goals: (1) adoption of a new, rich, uniform metadata schema to allow for LR description (a meta-data schema is has been widely adopted and very likely to be a good candidate for an standardization with ISO); (2) the set-up of a single platform consisting of a network of repositories providing easy, uniform, one-step access to LRs through the aggregation of LR sources into one catalogue that would facilitate the search and retrieval processes and (3) the establishment of a consistent and coherent legal framework into all IPR issues are dealt with and all exchanges/sharings can re-use a set of drafted licenses that accounted for users expectations and providers demands, known as META-SHARE Commons. This new initiative is supported by ELRA that brought in its own catalogue of over 1000 resources, including few hundreds that are made available free of charge for research purposes.
Another major action initiated by ELRA in partnership with LDC, NICT/ALAGIN and other players is a proposal for an International Standard Language Resource Number (ISLRN); it aims at the attribution of a new, unique and universal identifier to each LR to ensure that one can identify them across data centers.
After having reviewed existing identification, we conclude for the need to establish a specific identifier for LRs using a standardized nomenclature. This will ensure that LRs are correctly identified, and consequently, recognized with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
Moreover, it is also a major step in the networked and shared world of Human Language Technologies has become: unique resources must be identified as they are and meta-catalogues need a common identification format to manage data correctly. Therefore, LRs should carry identical identification schemes independently of their representations, whatever their types and wherever their physical locations (on hard drivers, Internet or Intranet) may be. Such initiative will be described in detail and launched at the LDC 20th anniversary workshop.
In order to allow for discussion and exchange of ideas, ELRA initiated in 1998 what became the major conference in this field, the Language Resources and Evaluation conference (LREC) that attracts over 1000 attendees from various horizons and sectors of activities (http://www.lrec-conf.org/). With a large number of satellite workshops, such event is the forum where all issues related to LR are discussed and projects, partnership, new initiatives emerge. Such event gave birth to initiatives such as the LRE-Map and the associated language matrices (a community built catalogue of resources used/described in LREC papers that complement the ELRA and other catalogues); the Language Library: an initiative to boost collaborative work between members of the community working on similar issues or similar resources; etc. The workshop presentation will focus on these visions for the next generation of Data Centers and the instruments to boost the international cooperation, building on the past 15 years of ELRA & LDC cooperation.
Tools and corpora for linguistic analysis are now moving into a new stage of increased interoperability. Advances such as the multilingual word nets, parsers, taggers, and other toolkits open up the possibility of creating systematic comparative analyses across corpora. However, the current practice of archiving corpora in discrepant original formats with complex access rights makes it impossible for the average user to apply these new tools. In order to benefit fully from these new opportunities, data centers need to move away from the model of archiving corpora to the model of creating interoperable open corpora that use formats based on syntactic, semantic, and linkage standards.
CLARIN, the Common Language Resources and Technology Infrastructure, has been created to serve the humanities research community in a very broad sense, i.e. not just computational linguists or speech scientists but also a broad spectrum of humanities scholars without any technical background or ambition, ranging from historians to philosophers and theologians. In my presentation I will briefly sketch the potential impact that the use of language resources and technology can have on these fields, but also the problems that have to be addressed in order to make this work.
Languages are falling out of use more quickly than linguists are able to document them. I will argue that the task involves engineering as much as linguistics. The engineering challenge is to collect large quantities of bilingual text---source texts and their sentence-aligned translations---from which lexicons and phrase structures can be inferred. I will describe my work on scalable methods for collecting bilingual written and spoken texts, involving dozens of endangered languages in Papua New Guinea.
will take a brief overview of data collection activities in NLP over the past 20 years and discuss some threads that need additional focus for linguistic annotation of language data over the coming decade.
Sociolinguistics is developing in a number of different methodological directions. In this talk, I discuss two of these directions: speech perception experiments and the automated analysis of large datasets. For the former, I present different methods and the kinds of research questions they can be used to address. For the latter, I discuss LaBB-CAT – a program that allows for fast and efficient quantitative analysis of spontaneous speech – and SOLIS – a collection of spontaneous speech corpora at UH Mānoa.
This presentation describes the current and future state of The Oyez Project covering the period from October 1955 when audio recording was introduced in the U.S. Supreme Court through June 2012 when a sharply divided court decided the Affordable Care Act cases (about 10,000+ hours of audio or 110+ million words.) We have added front-end functionality to clip and share; back-end editorial control for error correction; apps for iOS and Android simplifying flip-tap-listen-share; and, a single-pass approach to speaker identification.
Our grand challenge is to archive and share the variably maintained resources of all federal appeals courts and all state high courts that today record their public proceedings. Issues arise both in scale and complexity. We seek solutions in speaker identification and speech-to-text applications. Other topics include the identification of emotion and the creation and exploration of linked data.
It is hard to overestimate the impact the evaluation-driven research paradigm has made on human language technology (HLT). Yet, some fields, including the bioengineering community, continue to operate in a mode where research results cannot be easily verified and performance claims are often overly optimistic. The brain machine research community, in particular, is mired in this type of disorganization at a time when mass media and entrepreneurial interests in these technologies are at an all-time high.
Hence, we are developing, in collaboration with the Linguistic Data Consortium, a center at Temple University focused on the development of resources to advance brain machine interfaces. The center will build on the best practices developed by LDC, and is expected to eventually address a wide range of data needs in the bioengineering community. Our first corpus will be the release of over 10,000 EEG recordings conducted at Temple Hospital in Philadelphia, constituting the largest publicly available corpus of its type in the world. Machine learning technology developed on this data is expected to have both clinical and engineering research impact.
Brian W. Carver is Assistant Professor at the UC Berkeley School of Information where he does research on and teaches about intellectual property law and cyberlaw. He is also passionate about the public’s access to the law. In 2009 and 2010 he advised an I School Masters student, Michael Lissner, on the creation of CourtListener.com, an alert service covering the U.S. federal appellate courts. After Michael's graduation he and Brian continued working on the site and have grown the database of opinions to include over 750,000 documents, now including the entire Supreme Court corpus from 1754 to the present. In 2011 and 2012, Brian advised I School Masters students Rowyn McDonald and Karen Rustad on the creation of a legal citator built on the CourtListener database. The site is provided at no cost to the public, all of the site's code is licensed under open source licenses, and all the documents, with citation relationships, can be downloaded in bulk through an API. He hopes researchers that need large corpora of English-language documents will take advantage of this resource and would welcome proposals for research collaborations.
The quality, quantity, and characteristics of linguistic data are key determiners of the success of DARPA's human language technology research. Often the constraints of time, money, and availability make it impractical to work with real world data, yet without a firm foothold in the characteristics of the data on which language technologies must operate, they are sure to fail. The work of the Linguistic Data Consortium has been and no doubt will continue to be instrumental in filling the data needs of DARPA programs as we move further and further in the direction of data with operational characteristics, data of low resolution/quality, data with high noise and degradation, data from informal genres, multiple modes, and multiple dialects, smaller amounts of data with a more targeted nature, and faster data acquisition and creation.
This presentation will describe the data resources that are being collected to support the Babel program, and the challenges that performers will face in the program when working with this data. The goal of the Babel Program is to rapidly develop speech recognition capability for keyword search in a previously unstudied language, working with speech recorded in a variety of conditions and with limited amounts of transcription. This effort requires the collection of speech data in a wide variety of languages to facilitate research efforts and assess progress toward Program objectives. The speech data is being recorded in the country where the speakers reside and contains variability in speaker demographics and recording conditions. The Program addresses a broad set of languages from many different language families (e.g., Afro-Asiatic, Niger-Congo, Sino-Tibetan, Austronesian, Dravidian, Altaic), and the languages selected for the Program have a variety of phonotactic, phonological, tonal, morphological, syntactic, etc. characteristics. Performers work with a different set of Development Languages in each Program Period to create new methods, and then are evaluated at the end of each period on a surprise language with increasing limitations on the amount of transcribed training speech provided and on the time allowed to create the system across the periods of the program.
The Program initially focuses on telephone speech but will add speech recorded with various other devices in order to foster research on channel robustness. Key technical challenges for the Program that are facilitated by the data include, but are not limited to, methods that are effective across languages, robustness to speech recorded in noisy conditions with channel diversity, approaches to mitigate limited amounts of transcription, limited system build time, effective keyword search algorithms for speech, and analysis of factors contributing to system performance.
The history of the annotation efforts at the Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic, which dates back to the mid-nineties of the last century, will be presented, with examples from past and present mono- and multi-lingual corpora and good and bad experience from our annotation practice. Namely, the intertwined system of morphological, syntactic and semantic annotation of the family of the Prague Dependency Treebanks will be shown and discussed in comparison to other annotation efforts, such as the Penn Treebank(s). Recent efforts on bridging anaphora and discourse annotation will also be presented, as well as the issues related to parallel treebank construction. Future plans (and some organizational changes which the Institute is going through at the current time, especially in the area of language research infrastructure) will also be outlined.