NIEUW Workshop | Linguistic Data Consortium

NIEUW: Novel Incentives and Engineering Unique Workflows

Hosted by the Linguistic Data Consortium (LDC) and made possible by a National Science Foundation CISE CRI planning grant (#1629923), this workshop was held October 3-4, 2016 at the University of Pennsylvania in Philadelphia, PA.

Background

Although advances in Human Language Technology (HLT) have been significant, potential still remains largely untapped because the Linguistic Resources (LRs) that fuel development fall far short of need even in languages with the greatest numbers of speakers. The MetaNet White Paper series demonstrates that no language, not even English, has the complete complement of resources needed to build the language technologies that we know the market craves. Furthermore, reliance on short-term approaches that develop corpora for very narrow research tasks and then stop will never adequately address the grand challenge of enabling research truly representative of the world’s languages. The research community needs sustainable infrastructure to develop high quality LRs continuously and in a variety of languages without relying upon scarce resources such as project-based direct funding.

Historically, language resource development has relied on monetary incentives for contributions of raw data and linguistic annotations. In contrast, social media has employed a wider range of incentives to mobilize the power of the crowd to create data that is valuable to HLT (even if not intended for this purpose). These incentives include access to information and entertainment; to services and infrastructure based on contributions; payment or discounts in real and virtual worlds; possibilities for self-expression, sharing, and publicizing intellectual or creative work; chances to vent frustrations or convey thoughts, sometimes anonymously; exercises which develop competence that may lead to new prospects; forums for socializing; competition, status, prestige, and recognition; and opportunities to contribute to a greater cause or social good. Within HLT communities, there have been some projects that have successfully employed crowdsourcing and novel incentives such as gamification to solicit data and linguistic judgments from contributors. Phrase Detectives provides entertainment, challenge, and access to interesting reading in exchange for anaphora annotation. The Great Language Game tested players’ abilities to recognize world languages while recording their performance as a function of the input and multiple choices offered. Let’s Go mediates access to Pittsburgh Port Authority Transit bus schedules and route information while recording the interactions to improve system performance in real world situations especially for users such as non-native and elderly speakers. Zombilingo uses gamification and zombie graphics to get players to identify dependency relations in French sentences.

However, outside HLT communities, novel incentives, games, and citizen science have been used to even greater effect. For example, LibriVox offers volunteer contributors the chance to create audio recordings of public domain works of literature and poetry, develop their skills as readers and voice actors, work within a community of similarly minded volunteers, and enable access to the blind, illiterate, and others. While the recordings are primarily in English, LibriVox contains recordings in over 30 languages and is quickly approaching 10,000 recordings available for download. Another success story is the Citizen Science portal, Zooniverse, which boasts dozens of internet-based citizen science projects and, as of February 2014, over one million volunteers. Zooniverse includes projects in a range of scientific fields from astronomy to zoology, as well as linguistic exercises such as the transcription of originally handwritten bird watching journals and artist’s diaries or of the type written labels of insect collections.

Workshop Motivation

While lagging behind some other disciplines in the use of novel incentives, HLT researchers have used crowdsourcing to lower the cost of data collection and annotation. By offering human contributors sustained access to appropriate opportunities, activities, and incentives, we can enhance LR development well beyond what traditional direct funding alone can produce. However, along with these new incentives and workflows come new challenges.

One goal of the workshop is to identify and address some of these new challenges in applying human computation and novel incentives to the tasks of linguistic data collection and annotation in a variety of languages, especially under-resourced languages. A second goal is to bring together researchers and professionals from different scientific fields and disciplines who are united by similar challenges. While some of the specific tasks and overall goals may differ, many of the challenges and solutions will likely be shared across varied disciplines. Additionally, exposure to the experiences, challenges, and strategies related to novel incentives in other disciplines may stimulate previously unconsidered questions and solutions.

Presentations

NIEUW: Novel Incentives and Engineering Unique Workflows
Christopher Cieri and James Fiumara, Linguistic Data Consortium
Slides in PDF

New Approaches for Distributed Non-Expert Annotation and Collection at LDC
Stephanie Strassel, Linguistic Data Consortium
Slides in PDF

Building Communities Through Public Domain Audiobook Production at LibriVox.org
Elizabeth Klett, University of Houston – Clear Lake
Slides in PDF

The Afranaph Project – Expanding a dynamic resource by offering a service
Ken Safir, Rutgers University
Alexis Dimitriadis, Utrecht

Penn Sociolinguistics Archive
William Labov, University of Pennsylvania
Slides in PDF

The Gun Violence Database
Ellie Pavlick and Chris Callison-Burch, University of Pennsylvania
Slides in PDF

ZombiLingo: defying complexity
Karën Fort, Université Paris-Sorbonne
Bruno Guillaume, LORIA
Slides in PDF

Games-with-a-purpose for corpus annotation in the DALI project
Massimo Poesio, Richard Bartle, Jon Chamberlain, Chris Madge, and Udo Kruschwitz,
University of Essex
Slides in PDF

Crowd-Powered Conversational Systems
Walter Lasecki, University of Michigan

A Fun and Engaging Interface for Crowdsourcing Named Entities
Kara Greenfield, MIT Lincoln Laboratory
Kelsey Chan, MIT
Joseph P. Campbell, MIT Lincoln Laboratory
Slides in PDF

CrowdCurio: An Ecosystem for Research-Oriented Crowdsourcing
Alex Williams, University of Waterloo

Is That Right? Crowd and Consensus in Rare Manuscript Transcription for EMMO
Paul Dingman, Folger Shakespeare Library
Slides in PDF

Oral Histories: Linguistic Documentation as Social Media
Mark Liberman, University of Pennsylvania, Linguistic Data Consortium