June 2009 Newsletter

Thursday, June 18, 2009

New Corpora

GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1

Tagged Chinese Gigaword Version 2.0


LDC at 2009 ALA Annual Conference

We are pleased to announce that the Linguistic Data Consortium will be exhibiting at the American Library Association’s (ALA) Annual Conference in Chicago from July 11-14, 2009. In accordance with ALA’s conference policies, LDC’s members, friends and associates are eligible to receive a FREE exhibition pass for the duration of the conference (a $25 savings). ALA’s Annual meeting is a famous conference that typically attracts over 20,000 attendees and over 600 exhibitors, including: publishing houses, universities, university presses and many related organizations.  Please follow this link to take advantage of this solicitous offer:


You may forward this link to any student, coworker or colleague whom you think would be interested in viewing the exhibits at ALA 2009. The main conference lasts from July 9-15 and covers a wide range of topics related to information science, traditional library science, digital cataloging and more! Please follow these links for additional information on the main conference:

ALA Annual Conference main page |  Current exhibitors’ list  |  Main conference registration page

LDC will be exhibiting at small press table #1143. We hope to see you there!!

LDC at NAACL 2009

The North American Chapter of the ACL (Association for Computational Linguistics), NAACL, met at the University of Colorado at Boulder from May 31 - June 4. LDC is happy to report that we co-sponsored the entertainment at the festive gala dinner on June 2nd. NAACL featured a diverse collection of research papers and you may access the conference program here.

As a reminder, ACL’s annual meeting will be held in Singapore from August 2-7, 2009. Please click here to learn more about this conference and the ACL community.

LDC Introduces its Standard Arabic Morphological Tagger

At a recent LDC Institute seminar, Rushin Shah, a visiting scholar at LDC, presented a new tool for corpus annotation, the Standard Arabic Morphological Tagger (SAMT).  The current process of Arabic corpus annotation at LDC relies on using the Standard Arabic Morphological Analyzer (SAMA) to generate various morphology and lemma choices, and supplying these to manual annotators who then pick the correct choice. SAMA can generate dozens of choices for each word and does not provide any information about the likelihood of a particular choice being correct.  SAMT addresses these problems by ranking choices in order of their probabilities with a high degree of accuracy, and thereby, speeds annotation time.

You can view abstracts and presentation slides of this and other presentations in LDC's seminar series on data creation on our LDC Institute project page.