August 2012 Newsletter

Wednesday, August 15, 2012

New Corpora

English Web Treebank

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2

Spanish TimeBank 1.0


The Future of Language Resources: LDC 20th Anniversary Workshop
LDC’s 20th Anniversary Workshop is rapidly approaching! The event will take place on the University of Pennsylvania’s campus on September 6-7, 2012.
Workshop themes include: the developments in human language technologies and associated resources that have brought us to our current state; the language resources required by the technical approaches taken and the impact of these resources on HLT progress; the applications of HLT and resources to other disciplines including law, medicine, economics, the political sciences and psychology; the impact of HLTs and related technologies on linguistic analysis and novel approaches in fields as widespread as phonetics, semantics, language documentation, sociolinguistics and dialect geography; and the impact of any of these developments on the ways in which language resources are created, shared and exploited and on the specific resources required.

Fall 2012 LDC Data Scholarship Program

Applications are now being accepted through September 17, 2012, 11:59PM EST for the Fall 2012 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 20 individual students and student research groups.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Fall 2012 program cycle is September 17, 2012, 11:59PM EST.

Spotlight on HAVIC

As part of our 20th anniversary celebration, the coming newsletters will include features that provide an overview of the broad range of LDC’s activities. To begin, we'll examine the Heterogeneous Audio Visual Internet Collection (HAVIC), one of the many projects handled by LDC’s Collection/Annotation Group led by Senior Associate Director Stephanie Strassel.

Under the supervision of Senior Research Coordinator Amanda Morris, the HAVIC team is developing a large corpus of unconstrained multimedia data drawn from user-generated videos on the web and annotated for a variety of features. The HAVIC corpus has been designed with an eye toward providing increased challenges for both acoustic and video processing technologies, focusing on multi-dimensional variation inherent in user-generated content. Over the past three years the corpus has provided training, development and test data for the NIST TRECVID Multimedia Event Detection (MED) Evaluation Track, whose goal is to assemble core detection technologies into a system that can search multimedia recordings for user-defined events based on pre-computed metadata.

For each MED evaluation, LDC and NIST have collaborated to define many new events, including things like “making a cake” or “assembling a shelter”. Each event requires an Event Kit, consisting of a textual description of the event’s properties along with a few exemplar videos depicting the event. A large team of LDC data scouts search for videos that contain each event, along with videos that are only indirectly or superficially related to defined events plus background videos that are unrelated to any defined event. After finding suitable content, data scouts label each video for a variety of features including the presence of audio, visual or text evidence that a particular event has occurred. This work is done using LDC’s AScout framework, consisting of a browser plug-in, a database backend and processing scripts that together permit data scouts to efficiently search for videos, annotate the multimedia content, and initiate download and post-processing of the data. Collected data is converted to MPEG-4 format, with h.264 video encoding and AAC audio encoding, and the original video resolution and audio/video bitrates are retained.

To date, LDC has collected and labeled well over 100,000 videos as part of the HAVIC Project, and the corpus will ultimately comprise thousands of hours of labeled data. Look for portions of the corpus to appear among LDC’s future releases.