LREC 2016 Workshop | Linguistic Data Consortium

Novel Incentives for Collecting Data and Annotation from People: types, implementation, tasking requirements, workflow and results

This first workshop on novel incentives in linguistic data collection was held in conjunction with the Tenth International Conference on Language Resources and Evaluation (LREC2016) in Portorož, Slovenia during the afternoon of May 28, 2016 at the Grand Hotel Bernardin Conference Center.

Background

Despite more than two decades of effort from many research groups and large data centers, the supply of language resources falls far short of need even in the languages with the greatest number of speakers, controlling the largest shares of the world economy. For languages with less international recognition, resources are scarce, fragmentary or absent. Recent programs such as DARPA LORELEI recognize and attempt to address this gap but even they will provide only core resources for a few dozen languages, a small proportion of the >7000 currently in use worldwide.

In Language Resource (LR) development the commonest incentives for contributors are monetary. Whether motivated by convenience or ethical beliefs, that bias limits the Human Language Technology (HLT) community’s ability to collect data and understand how different incentives impact collection. Because linguistic innovation is effectively limitless, relying upon a limited resource, monetary compensation, to generate the data needed to document the world’s languages is certain to fall short. Instead LR developers and users must develop and employ incentives that scale beyond the budget of a 3- or 5-year program.

A few HLT projects have employed alternate incentives. Phrase Detectives provides entertainment, challenge and access to interesting reading in exchange for anaphora annotation. Herme gave participants the unusual experience of interacting verbally with a tiny, cute robot while recording their interactions. Let’s Go mediates access to Pittsburgh Port Authority Transit bus schedules and route information while recording the interactions to improve system performance in real world situations, especially for ‘extreme’ users such as non-native and elderly speakers.

However, outside our field, collections employ variable incentives to much greater effect, creating massive data resources. LibriVox offer contributors the chance to create audio recordings of classic works of literature, develop their skills as reader and voice actors, work within a community of similarly minded volunteers and enable access to the blind, illiterate and others. Zooniverse includes linguistic exercises such as the transcription of originally handwritten bird watching journals and artists' diaries or of the typewritten labels of insect collections. Social media has employed a wide range of incentives including:

access to information and entertainment
possibilities for self-expression, sharing and publicizing intellectual or creative work
chances to vent frustrations or convey thoughts, sometimes anonymously
forums for socializing; exercises which develop competence that may lead to new prospects
competition, status, prestige, and recognition
payment or discounts in real and virtual worlds
access to services and infrastructure based on contributions
novel experiences and improved interactions, for example in a customer service encounter
opportunities to contribute to a greater cause or good

While lagging behind in the use of novel incentives, HLT researchers have productively used crowdsourcing to lower collection and annotation costs and developed techniques for customizing tasking to meet the capacity of the crowd and fusing highly variable results into data sets that advance technology development. Similar techniques apply to the use of alternate incentives in collecting data from a non-traditional workforce.

This half-day workshop opened the discussion on incentives in data collection describing novel approaches and comparing with traditional monetary incentives. Related topics included: descriptions of projects that use the alternate incentives listed above or others; modifications of the data collection and annotation tasking or workflow to accommodate a new workforce, including crowdsourcing; techniques for exploiting the results of alternate incentives and novel workflows.

Outline

Introduction by Workshop Chair
a. Christopher Cieri, Novel Incentives in Language Resource Development
Paper in PDF, Slides in PDF
Novel Incentives in Data Collection and Requisite Processing
a. Nick Campbell, Herme & Beyond; the Collection of Natural Speech Data
Paper in PDF
b. Kensuke Mitsuzawa, Maito Tauchi, Mathieu Domoulin, Masanori Nakashima, Tomoya Mizumoto, FKC Corpus: a Japanese Corpus from New Opinion Survey Service
Paper in PDF
Novel Incentives and Workflows for Annotation
a. Kara Greenfield, Kelsey Chan, Joseph P. Campbell, A Fun and Engaging Interface for Crowdsourcing Named Entities
Paper in PDF
b. Massimo Poesio, Jon Chamberlain, Udo Kruschwitz and Chris Madge, Novel Incentives for Phrase Detectives
Paper in PDF
Understanding and Exploiting Data from Alternative Sources
a. Na’im Tyson, Jonathan Roberts, Jeff Allen, Matt Lipson, Evaluation of Anchor Texts for Automated Link Discovery in Semi-structured Web Documents
Paper in PDF
b. Maxine Eskenazi, Sungjin Lee, Tiancheng Zhao, Ting Yao Hu, Alan W Black, Unconventional Approaches to Gathering and Sharing Resources for Spoken Dialog Research
Paper in PDF
The Future of Incentives, Workforces, Workflows and Data Exploitation
a. Mark Liberman, Oral Histories: Linguistic Documentation as Social Media
Paper in PDF

Program Committee:

Christopher Cieri (LDC)
Chris Callison-Burch (University of Pennsylvania)
Nick Campbell (Trinity College)
Maxine Eskenazi (Carnegie Mellon University)
Massimo Poesio (University of Essex, University of Trento)
Stephanie Strassel (LDC)
Jonathan Wright (LDC)

(bold denotes Organizing Committee members)