December 2012 Newsletter

Monday, December 17, 2012

New Corpora

GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web

Russian-English Computer Security Parallel Text

Announcements

Spring 2013 LDC Data Scholarship Program - Deadline Approaching
The deadline for the Spring 2013 LDC Data Scholarship Program is one month away!   Student applications are being accepted now through January 15, 2013, 11:59PM EST.  The LDC Data Scholarship program provides university students with access to LDC data at no cost.  This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.  

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. 

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

Two New LDC Podcasts for your Listening Pleasure
Two new podcasts are available on a the LDC blog continuing the  20th Anniversary series. The first features Natalia Bragilevskaya, LDC’s Business Administrator, Membership Coordinator Ilya Ahtaridis and Marian Reed, Marketing Coordinator. They recall the early days of LDC and describe the growth of sponsored projects work and LDC’s interactions with its membership.

The third podcast in the series introduces the community to two  LDC’ researchers Yiwola Awoyale and Moussa Bamba, whose work focuses on West African languages.

Yiwola has been teaching Linguistics, Yoruba language studies and various aspects of African linguistics since 1975. At LDC, he developed the Global Yoruba Lexical Database, a set of related dictionaries based on Yoruba and its diaspora. Moussa’s work in the Manding languages of the Niger-Congo family has resulted in the release of the Mawukakan Lexicon, to be followed by similar resources for Maninkakan, Bambara, and Jula.

In their podcast, Yiwola and Moussa discuss how they came  to LDC, their current research and how it benefits multiple communities.

Other podcasts will be published via the LDC blog, so stay tuned to that space.

Penn Discourse Treebank Version 2.0 Update
The developers of the Penn Discourse Treebank Version 2.0  (PDTB) have updated this release to add metadata to the Wall Street Journal (WSJ) news stories in the corpus. The goal is to aid understanding PDTB files as texts and to support distinguishing texts from different genres within the WSJ.

The metadata includes the following fields:

DD: the date the article appeared in the WSJ

AN: unique identifier for the article

HL: the column name (for regular features such as Who's News, Marketing & Media, Technology), its headline and by-line

SO: the source of the article

IN: manually-assigned codes or keywords for the article

CO: manually-assigned codes for companies or other organizations

DATELINE: normally the location where the article was filed, but sometimes has very unexpected contents

GV: Branch of Government or Government Agency mentioned in the article

SBREAKS: the byte position of section breaks present in the file

ARTICLEBREAK: separates files that contain more than one article

All new downloads of PDTB will contain the complete updated corpus.  Current PDTB licensees can re-download the file to obtain the updated data.

LDC to Close for Winter Break
LDC will be closed from Monday, December 24, 2012 through Tuesday, January 1, 2013 in accordance with the University of Pennsylvania Winter Break Policy.  Our offices will reopen on Wednesday, January 2, 2013.  Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

Best wishes for a happy and safe holiday season!