LDC was awarded a '2nd META Prize' from META-NET ‘for outstanding long term commitment to the preparation and distribution of language resources and technologies'.
The META Prize is awarded by META-NET to those who provide outstanding products or services that support the European Multilingual Information Society. META-NET is a Network of Excellence dedicated to fostering the technological foundations of a multilingual European information society. Several organizations were honored at this year’s META Forum in Budapest; LDC and ELRA were both honored for supporting and developing language resources.
LDC is proud to announce that our founder and Director, Mark Liberman, was awarded the 2010 Antonio Zampolli prize at LREC2010. This prestigious honor is given by ELRA’s board members to recognize “outstanding contributions to the advancement of language resources and language technology evaluation within human language technologies”.
LDC and its research team partner Oxford University are one of eight international research teams to have been awarded the first Digging into Data Challenge grants for projects that promote innovative humanities and social science research using large-scale data analysis. Four leading research agencies sponsor the international competition: The Joint Information Systems Committee (JISC) from the United Kingdom, the National Endowment for the Humanities and the National Science Foundation (NSF) from the United States and the Social Sciences and Humanities Research Council from Canada.
LDC and Oxford University (with the participation of the The British Library) have been funded by NSF and JISC, respectively, for a project entitled “Mining a Year of Speech,” which will focus on creating tools to enable rapid and flexible access to more than 9,000 hours of spoken audio files. Those files contain a wide variety of speech drawn from some of the leading British and American spoken word corpora, allowing for news kinds of linguistic analysis.
Further information about the Digging into Data Challenge can be found on the project website.
The data sheets were distributed on FSC certified 30% recycled paper and were printed using environmentally-friendly toner. FSC certification means that the process that developed the paper, from seed to final sheet, is in compliance with international laws and treaties so that it employs fair labor standards and respects and conserves environmental resources.
LDC intends to expand the breadth of data sheet categories and the depth of information provided within each category. This will help to accurately represent our organization and highlight our staff’s research and development efforts.
[ top ]
LDC distributes a host of resources which are available for free. These resources include tools and corpora developed at LDC as well as corpora made available through LDC's strong network of data providers.
Since LDC's founding, we have distributed over 1300 copies of corpora at no cost including:
- over 700 non-member downloads of Buckwalter Arabic Morphological Anaylzer 1.0
- 400 copies of Talkbank-sponsored data including popular releases such as the American National Corpus and the Santa Barbara Corpora of Spoken American English
- nearly 200 copies of Web 1T 5-gram Version 1, sponsored by Google Inc.
- over 30 copies of TimeBank 1.2
- over a dozen copies of the corpora developed for the Unified Linguistic Annotation (ULA) project
[ top ]
LDC is pleased to announce that The LDC Corpus Catalog has been awarded a five-star quality rating, the highest rating available, by the Open Language Archives Community (OLAC). OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. LDC supports OLAC and is among the 37 participating archives who have contributed over 36,000 records to the combined catalog of language resources. OLAC seeks to refine the quality of the metadata in catalog records in order to improve the quality of searching that users can do over that catalog. When resources are described following the best practice guidelines established by OLAC, it increases the likelihood that all the resources returned by a query are relevant (precision) and that all relevant resources are returned (recall).
Certain metadata in the LDC catalog was missing, inaccurate and/or non-compliant with OLAC standards for several fields. Over a period of a few months, a team at LDC took several steps to make that metadata OLAC-compliant. Most significantly, the language name and the language ID for over 400 corpora were reviewed and changed when required to conform to the new standard for language identification, ISO 639-3. Additional efforts focused on providing author information for all corpora and fixing dead links. Finally, the team added a new metadata field to consistently document the "type" of each resource, using a standard vocabulary from the digital libraries community called DCMI-Type, reliably distinguishing text and sound resources. The benefits of these revisions include improving LDC's management of resources in the catalog as well as assisting LDC users to quickly identify all corpora which are relevant to their research.
[ top ]
The numbers are in and LDC's early renewal discount program was a success! Nearly 100 organizations who renewed membership or joined early received a discount on fees for Membership Year (MY) 2009. Taken together, these members saved over US$50,000! MY 2008 members are reminded that they are still eligible for a 5% discount when renewing. This discount will apply throughout 2009, regardless of time of renewal.
By joining for MY 2009, any organization can take advantage of membership benefits including free membership year data as well as deep discounts on older LDC corpora. Please visit our Members FAQ for further information.
[ top ]
LDC is pleased to announce that the U.S. Department of Education, International Education Programs Service, has funded a collaboration between LDC and Georgetown University Press (GUP) to create up-to-date lexical databases, with translations to and from English, for three dialects of colloquial Arabic. The databases will be used for interactive computer access and for new print publications of dictionaries in Iraqi, Syrian/Levantine and Moroccan dialects.
The databases will be based on three GUP source dictionaries: A Dictionary of Iraqi Arabic, English-Arabic, Arabic-English (Clarity, et al., 2003), A Dictionary of Syrian Arabic, English-Arabic (Stowasser and Ani, 2004) and a Dictionary of Moroccan Arabic, Arabic-English, English-Arabic (Harrell and Sobelman, 2004). Utilizing contemporary principles of computational linguistics and current pedagogical requirements in order to reflect current vocabulary and usage, the work will provide a standardized system of transcription and use the Arabic script, both vocalized and unvocalized, to show vowel pronunciation as well as standard orthography. A searchable version on CD-ROM will accompany each print reference. The project has been funded for three years. Work will commence in Year 1 with the Iraqi Arabic dictionary, proceed to the Syrian/Levantine dictionary and conclude with the Moroccan Arabic dictionary.
The proposed dictionaries and databases aim to provide U.S. students and teachers of Arabic with current dialectal Arabic lexical information to enable them to communicate orally with native and non-native Arabic speakers. The scholarship used to create a modernized transcription system and to provide existing and new terms in Arabic script (including diacritics) may also help integrate instruction in dialect and Modern Standard Arabic by providing tools for curriculum developers.
[ top ]
The ACL Anthology is a digital archive of 12,500 research papers in computational linguistics, stretching back to 1965. All papers are available for free download. Steven Bird established the anthology in 2001, while he was associate director at the LDC. The initial digitization of 50,000 pages of articles was possible through the generous support of institutional and individual sponsors. For the next 6 years, the anthology was hosted on the LDC website, and it came to play a central role in the day-to-day work of computational linguists the world over. Today, conference proceedings are added to the Anthology at the time of each conference, providing immediate free access to the latest research findings. In 2007, the digitization of legacy materials was completed and the anthology was migrated to the website of the Association for Computational Linguistics. Steven passed on the editorship to Min-Yen Kan. Ongoing activities with the anthology include citation linking and extraction of raw text. The LDC is pleased to have to have contributed to the development of the anthology and wishes the current editor continued success in providing this valuable resource. Visit the ACL website for further information on ACL conferences, membership, and publications.
[ top ]