June 2012 Newsletter

Friday, June 15, 2012

New Corpora

Arabic-Dialect/English Parallel Text

Prague Czech-English Dependency Treebank 2.0


LDC at LREC 2012

LDC attended the 8th Language Resource Evaluation Conference (LREC2012), hosted by ELRA, the European Language Resource Association. The conference was held in Istanbul, Turkey and featured a broad range of sessions on language resource and human language technologies research. Fourteen LDC staff members presented current work on a wide range of topics, including handwriting recognition, word alignment, treebanks, machine translation and information retrieval as well as initiatives for synchronizing metadata practices in sociolinguistic data collection.

The LDC Papers page now includes research papers presented at LREC 2012.  Most papers are available for download in pdf format; presentations slides and posters are available for several papers as well. On the Papers page, you can read about LDC's role in resource creation to support handwriting recognition and translation technology (Song et al 2012).   LDC is developing resources to support two research programs:  Multilingual Automatic Document Classification, Analysis and Translations (MADCAT) and Open Handwriting Recognition and Translation (OpenHaRT).  To support these programs, LDC is collecting handwritten samples of pre-processed Arabic and Chinese data that had previously been translated into English.  To date, LDC has collected and annotated over 225,000 handwriting images.

Additionally, you can learn about LDC's efforts to collect and annotate very large corpora of user-contributed content in multiple languages (Garland et al, 2012).   For the Broad Operational Language Translation (BOLT) program, LDC is developing resources to support genre-independent machine translation and information retrieval systems.  In the current phase of BOLT, LDC is collecting and annotating threaded posts from online discussion forums, targeting at least 500 millions words each in three languages:  English, Chinese, and Egyptian Arabic.  A portion of the data undergoes manual, multi-layered linguistic annotation.