May 2009 Newsletter

Friday, May 22, 2009

New Corpora

2008 CoNLL Shared Task Data

English Gigaword Fourth Edition

GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2


LDC at MEDAR Conference

LDC was pleased to attend the 2nd International Conference on Arabic Language Resources and Tools recently held in Cairo, Egypt.  The conference was organized by the Mediterranean Arabic Language and Speech Technology consortium (MEDAR), a new NEMLAR initiative.  LDC researchers presented papers on their recent work in various Arabic projects including Treebank annotation, handwriting recognition and broadcast news collection and transcription (the latter in collaboration with the Evaluations and Language resources Distribution Agency (ELDA)). LDC’s Executive Director, Chris Cieri, discussed ways to share language resources across the region within the MEDAR framework.

Cieri and other conference attendees were interviewed by Emmy Adul Alim, a staff reporter for, a MEDAR sponsor. The resulting article “The Breakthrough of Arabic Language Technologies”, discusses the accomplishments and challenges of creating accessible Arabic human language technologies. Cieri highlighted LDC’s work with al-hakawati, the Arab Cultural Trust, to identify and digitize Arabic heritage texts. Al-hakawati makes the digitized materials immediately available on its website to end users, and LDC is developing a database of these texts that scholars can study for language change over time and across genres.

You can view LDC papers and poster presentations, including those from the MEDAR Conference, on our Papers page.  Papers date from 1998 forward and most can be downloaded in pdf format.  Presentations slides and posters are available for several papers as well.

Early Renewing LDC Members Saved Big!

The numbers are in and LDC's early renewal discount program was a success!   Nearly 100 organizations who renewed membership or joined early received a discount on fees for Membership Year (MY) 2009. Taken together, these members saved over US$50,000!  MY 2008 members are reminded that they are still eligible for a 5% discount when renewing. This discount will apply throughout 2009, regardless of time of renewal.

By joining for MY 2009, any organization can take advantage of membership benefits including free membership year data as well as deep discounts on older LDC corpora.  Please visit our Members FAQ for further information.

Membership Mailbag:  Navigating the LDC Intranet - Part 2

LDC's Membership office responds to a few thousand emailed queries a year, and, over time, we've noticed that some questions tend to crop up with regularity.  To address the questions that you, our data users, have asked, we'd like to continue our Membership Mailbag series of newsletter articles.  Last month we focused on a few features of the LDC Intranet including establishing an account and using that account to access information about your organization's history with LDC. This month, we'll take a look into using your account to access password-protected corpora and resources.

LDC's Intranet contains the following links:

Customer Profile
LDC Online
Corpora Available for Download

LDC Online and Corpora Available for Download sections.  After registering for an LDC Intranet account, users can access LDC Online both through the LDC Intranet and the LDC Online page on LDC's website.  LDC Online contains an indexed collection of Arabic, Chinese and English newswire text, millions of words of English telephone speech from the Switchboard and Fisher collections and the American English Spoken Lexicon, as well as the full text of the Brown corpus.

To download corpora that your organization has licensed, visit the Corpora Available for Download section.  This section contains all web-download corpora the organization has licensed, with the most recently invoiced requests listed first.  Any registered user of an organization can utilize the web-download service at any time to view and access the corpora that have been invoiced for delivery over the web.  This section will not contain all corpora that an organization has licensed, only those small enough for web-download.

Recently, LDC has made available for web-download some popular resources which were previously distributed only on disc.  These resources include TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1), CELEX2 (LDC96L14), and Treebank-3 (LDC99T42).  If an organization has obtained a license to any of these resources, registered users can simply log in to download the data, thereby eliminating the need to locate the copy on disc or license a new copy.

Got a question?  About LDC data?  Forward it to .  The answer may appear in a future Membership Mailbag article.