June 2010 Newsletter | Linguistic Data Consortium

New Corpora

Announcements

LDC at LREC2010
LDC attended the 7th biennial Language Resource Evaluation Conference (LREC2010), hosted by ELRA, the European Language Resource Association. Fourteen LDC staff members presented current research on a wide range of topics, including word alignment, treebanks, machine translation and speech studies as well as updates on language resource development, initiatives for cataloging and distributing resources and legal issues associated with sharing language resources. The conference was located in Valletta, Malta and featured interesting and invigorating research sessions that brought together over 1200 attendees from around the globe.

LDC was also invited to participate in the Projects Village alongside a dozen European Commission-funded projects. We would like to thank ELRA for this unique opportunity as well as the numerous conference attendees who stopped by the LDC table to say hello or learn more about the Consortium.

Mark Liberman, LDC Director, wins the 2010 Antonio Zampolli Prize
LDC is proud to announce that our founder and Director, Mark Liberman, was awarded the 2010 Antonio Zampolli prize at LREC2010. This prestigious honor is given by ELRA’s board members to recognize “outstanding contributions to the advancement of language resources and language technology evaluation within human language technologies”.

Mark’s prize talk, delivered on May 21, 2010 and entitled The Future of Computational Linguistics: or, What Would Antonio Zampolli Do?, discussed Antonio Zampolli’s far-reaching contributions to the language technology community and how his vision resonates in Mark’s research. Please join us in congratulating Mark on receiving this award.

Use of LDC Data by High School Students
Last month, LDC announced our new LDC Data Scholarship program which will provide university students with no-cost access to LDC data. Please stay tuned for further announcements and application materials for the fall 2010 program cycle. LDC data is used for research by students at over 1000 universities and colleges across the globe. What our data users may not know is that even younger students are using LDC data in their research. Our databases have been used by high school (secondary school) students for science fair projects. We’d like to highlight two intriguing examples.

Carmel High School, Carmel, CA, USA. Dylan Freedman, a student at Carmel High School in Carmel, CA, investigated the creation of an efficient English text compression algorithm based on the n-grams contained in the Web 1T 5-gram Version 1 (LDC2006T13) database. The algorithm he developed worked particularly well on small text files around 100 bytes long; for such files the average compression ratio was about 30%, whereas traditional compression algorithms were not able to compress such small files.

Dylan worked with two advisers, Craig Martell and George Dinolt, from the Naval Postgraduate School in Monterey, CA who assisted in developing project ideas and in supplying the official jargon for some natural language ideas he implemented. For example, although he didn’t originally know the term by name, Dylan used a Katz's back-off model when querying the n-gram database to calculate the rankings for how often a word would occur given a few preceding context words.

His project, A Novel Approach to Text Compression Using N-Grams, won the Grand Prize at the Monterey County Science Fair, which enabled Dylan to participate in the Intel International Science and Engineering Fair (Intel ISEF) earlier this year in San Jose, CA. At Intel ISEF, Dylan won awards from the IEEE Computer Society and the Office of Naval Research.

Albuquerque Public Schools, Albuquerque, NM, USA. In 2005, students enrolled in a Computer Science class in the Career Enrichment Center at Albuquerque Public Schools, Albuquerque, NM, used TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and The CMU Kids Corpus (LDC97S63) for a science fair project. A team of three students used a neural network to see if they could identify characteristics of a speaker; in particular, they wanted to see if they could differentiate speakers by age (above or below a certain age) and by gender. In order to complete the project, the students trained the neural network using the LDC speech data. The students received recognition at both the district and state levels of the New Mexico Science and Engineering Fair.

LDC would like to thank the high school students who contributed to this article and wish them continued success in their studies!