Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


Recent Announcements from the LDC

LDC to Close for Thanksgiving Day

LDC will be closed on Thursday, November 26, 2009 and Friday, November 27, 2009 for the Thanksgiving Day holiday. Our offices will reopen on Monday, November 30, 2009.

[ top ]

LDC Incentives: Early Renewal Discounts for Membership Year (MY) 2010

LDC appreciates the important contribution LDC members make through their continued support of the consortium. We would like to invite all current and previous members of LDC to renew, as well as new members to join, for Membership Year (MY) 2010. For MY2010, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, in October's newsletter, we announced an LDC Incentives Package which will include a host of incentives to help lower the cost of LDC membership and data licensing fees. As part of this package, LDC will extend discounts to members who keep their membership current and who join early in the year.

The details of our Early Renewal Discounts for MY2010 are as follows:

  • Organizations who joined for MY2009, will receive a 5% discount when renewing. This discount will apply throughout 2010, regardless of time of renewal. MY2009 members renewing before March 1, 2010 will receive an additional 5% discount, for a total 10% discount off the membership fee.

  • New members as well as organizations who did not join for MY2009, but who held membership in any of the previous MY's (1993-2008), will also be eligible for a 5% discount provided that they join/renew before March 1, 2010.

The Membership Fee Table provides exact pricing information



MY2010 FeeMY2010 Fee with 5% DiscountMY2010 Fee with 10% Discount
Not-for-Profit
StandardUS$2400US$2280US$2160
SubscriptionUS$3850US$3657.50US$3465
For-Profit
StandardUS$24000US$22800US$21600
SubscriptionUS$27500US$26125US$24750

Publications for MY2010 are still being planned but it will be another productive year with a broad selection of publications. The working titles of data sets we intend to provide include:

Arabic Treebank: Part 2 v 4.0 Fisher Spanish
Chinese Treebank 7.0 LCTL Bengali
Chinese Web N-gram Version 1.0 NPS Chat Corpus

In addition to receiving new publications, current year members of the LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.

This past year, nearly 100 organizations who renewed membership or joined early received a discount on membership fees for MY2009. Taken together, these members saved over US$50,000! All LDC Members have been sent an invitation to join letter and renewal invoice for MY2010. Renew early for MY2010 and save today!

[ top ]

LDC at NWAV 38

LDC exhibited at NWAV for the third straight year. We were delighted to interact with so many talented sociolinguistic researchers and to introduce numerous attendees to LDC and our data catalog. LDC distributed free copies of both the SLX Corpus of Classic Sociolinguistic Interviews, as per the terms of the Talkbank grant, and the 2008 LDC Spoken Language Sampler, which is available for download here. We also distributed many of our newly minted data sheets, including one featuring the speech annotation tool XTrans. This tool is also freely available from our website in Linux and Windows formats.

LDC’s Executive Director, Chris Cieri, and Senior Associate Director, Stephanie Strassel, presented papers on the following topics:

Thanks again to everyone who stopped by our display and we look forward to seeing you again next year!

[ top ]

Release of XTrans

At Interspeech 2009, LDC introduced XTrans, a new tool for manual transcription and annotation of audio recordings.  XTrans is a next generation transcription tool that is designed to support transcription tasks in multiple languages on multiple platforms.   XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.  Its versatile and powerful waveform display/playback component can load multiple audio files of different file formats and sampling rates at the same time. LDC and its partners have used XTrans to generate over 3500 hours of time-aligned verbatim transcripts in a variety of genres and languages.

With an intuitive interface, user configurability and embedded QC functions, XTrans is optimized for high-quality, high-volume transcription tasks involving real world data. XTrans successfully addresses the challenges of real world data including transcribing multiple speakers in a single channel through Virtual Speaker Channel, which enables an unlimited number of distinct speakers to be associated with the same audio channel.  Furthermore, XTrans allows transcribers to open an effectively unlimited number of audio files for simultaneous transcription. Transcribers can switch focus between one, two or multiple speakers as needed.  XTrans also provides strong multilingual support, with bidirectional text input for languages like Arabic, Farsi, Urdu, and Hebrew.

Realtime transcription rates have improved dramatically in LDC projects using XTrans, with rates for some tasks cut by as much as half.   XTrans also brings key quality control functions directly into the interface, giving transcribers the power to improve the quality of their own work.  XTrans components are written in Python and C++, utilizing LDC's QWave waveform display module. Even with very large files or multiple recordings, XTrans provides users with fast display and playback capabilities.  A range of audio formats is supported, including .sph, .wav, .aiff, .flac, and .ogg. Transcripts are output in a Tab Delimited Format (TDF), which is easily converted to other common formats and is readily usable by downstream manual and automatic annotation tasks.

Availability:

XTrans for Linux and Windows platforms is available from the LDC at no cost under GPLv3 and can be downloaded here .

[ top ]

LDC at Interspeech 2009

LDC is pleased to announce its participation at Interspeech 2009 in Brighton, UK, September 6-10, 2009. LDC researchers will present papers on the following topics (conveniently in the same session):

  • XTrans: A Speech Annotation and Transcription Tool
    Thursday 10 September 2009, Session 2-O4, 13.30 (paper #3)
  • The Broadcast Narrow Band Speech Corpus: A New Resource Type for Large Scale Language Recognition
    Thursday 10 September 2009, Session 2-O4, 13.30 (paper #6)

Two papers co-authored by LDC's director, Mark Liberman, will also be presented:

  • Automatic Formant Extraction for Sociolinguistic Analysis of Large Corpora (co-authors Keelan Evanini, Stephen Isard)
    Wednesday 9 September 2009, Session 1-P1 10:00 (paper #3)
  • Investigating /l/ Variation in English through Forced Alignment (co-author Jiahong Yuan)
    Wednesday 9 September 2009, Session 3-O2 16:00 (paper #5)

Visit our display in the exhibition hall at the Brighton Centre on Kings’ Road for a special giveaway or just to say hello.

Follow the link for more information on Interspeech 2009.

[ top ]

LDC at ALA 2009 - Update

LDC is happy to report that our exhibit at ALA 2009 went off without a hitch! The American Library Association’s (ALA) annual conference was located in Chicago, IL and attracted over 20,000 exhibitors and attendees. We met our members face to face, connected with new associates in a dynamically-hosted conference setting and learned more about the needs of the library community. We hope to see you all next year!

Follow the link for additional information on the ALA Conference.

[ top ]

LDC at ALA 2009 and NAACL 2009

LDC at 2009 ALA Annual Conference

We are pleased to announce that LDC will be exhibiting at the American Library Association’s (ALA) Annual Conference in Chicago from July 11-14, 2009. The main conference lasts from July 9-15 and covers a wide range of topics related to information science, traditional library science, digital cataloging and more. Please follow these links for additional information on the main conference:

ALA Annual Conference main pageCurrent exhibitors’ list  |  Main conference registration page

LDC will be exhibiting at small press table #1143. We hope to see you there!!

LDC at NAACL 2009

The North American Chapter of the ACL (Association for Computational Linguistics), NAACL, met at the University of Colorado at Boulder from May 31 - June 4. LDC is happy to report that we co-sponsored the entertainment at the festive gala dinner on June 2nd. NAACL featured a diverse collection of research papers and you may access the conference program here.

ACL’s annual meeting will be held in Singapore from August 2-7, 2009. Please click here to learn more about this conference and the ACL community.

[ top ]

LDC at MEDAR Conference

LDC was pleased to attend the 2nd International Conference on Arabic Language Resources and Tools recently held in Cairo, Egypt. The conference was organized by the Mediterranean Arabic Language and Speech Technology consortium (MEDAR), a new NEMLAR initiative. LDC researchers presented papers on their recent work in various Arabic projects including Treebank annotation, handwriting recognition and broadcast news collection and transcription (the latter in collaboration with the Evaluations and Language resources Distribution Agency (ELDA)). LDC’s Executive Director, Chris Cieri, discussed ways to share language resources across the region within the MEDAR framework.

Cieri and other conference attendees were interviewed by Emmy Adul Alim, a staff reporter for IslamOnline.net, a MEDAR sponsor. The resulting article “The Breakthrough of Arabic Language Technologies”, discusses the accomplishments and challenges of creating accessible Arabic human language technologies. Cieri highlighted LDC’s work with al-hakawati, the Arab Cultural Trust, to identify and digitize Arabic heritage texts. Al-hakawati makes the digitized materials immediately available on its website to end users, and LDC is developing a database of these texts that scholars can study for language change over time and across genres.

You can view LDC papers and poster presentations, including those from the MEDAR Conference, on our Papers page. Papers date from 1998 forward and most can be downloaded in pdf format. Presentations slides and posters are available for several papers as well.

[ top ]

Early Renewing LDC Members Saved Big!

The numbers are in and LDC's early renewal discount program was a success! Nearly 100 organizations who renewed membership or joined early received a discount on fees for Membership Year (MY) 2009. Taken together, these members saved over US$50,000! MY 2008 members are reminded that they are still eligible for a 5% discount when renewing. This discount will apply throughout 2009, regardless of time of renewal.

By joining for MY 2009, any organization can take advantage of membership benefits including free membership year data as well as deep discounts on older LDC corpora. Please visit our Members FAQ for further information.

[ top ]

LDC's Corpus Catalog Receives Top OLAC Rating

LDC is pleased to announce that The LDC Corpus Catalog has been awarded a five-star quality rating, the highest rating available, by the Open Language Archives Community (OLAC). OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. LDC supports OLAC and is among the 37 participating archives who have contributed over 36,000 records to the combined catalog of language resources. OLAC seeks to refine the quality of the metadata in catalog records in order to improve the quality of searching that users can do over that catalog. When resources are described following the best practice guidelines established by OLAC, it increases the likelihood that all the resources returned by a query are relevant (precision) and that all relevant resources are returned (recall).

Certain metadata in the LDC catalog was missing, inaccurate and/or non-compliant with OLAC standards for several fields. Over a period of a few months, a team at LDC took several steps to make that metadata OLAC-compliant. Most significantly, the language name and the language ID for over 400 corpora were reviewed and changed when required to conform to the new standard for language identification, ISO 639-3. Additional efforts focused on providing author information for all corpora and fixing dead links. Finally, the team added a new metadata field to consistently document the "type" of each resource, using a standard vocabulary from the digital libraries community called DCMI-Type, reliably distinguishing text and sound resources. The benefits of these revisions include improving LDC's management of resources in the catalog as well as assisting LDC users to quickly identify all corpora which are relevant to their research.

[ top ]

LDC Spoken Language Sampler

The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC’s Catalog. Created for distribution at NWAV 37 and geared towards sociolinguists, the sampler is a good introduction to data available from the LDC. The sampler includes excerpts from telephone conversations in Arabic (Gulf, Iraqi, and Levantine dialects) Farsi, Japanese, Korean, Spanish, and Tamil; dictionary resources for Mawukakan and Tamil; transcribed meeting speech; utterances in Russian from native and non-native speakers; and speech samples which represent regional accents and dialects of the United States. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts.

The sampler can be downloaded for free from the catalog page for the LDC Spoken Language Sampler. Please scroll down to 'How to Obtain' for a download link.

[ top ]

Collaboration between LDC and Georgetown University Press

LDC is pleased to announce that the U.S. Department of Education, International Education Programs Service, has funded a collaboration between LDC and Georgetown University Press (GUP) to create up-to-date lexical databases, with translations to and from English, for three dialects of colloquial Arabic. The databases will be used for interactive computer access and for new print publications of dictionaries in Iraqi, Syrian/Levantine and Moroccan dialects.

The databases will be based on three GUP source dictionaries: A Dictionary of Iraqi Arabic, English-Arabic, Arabic-English (Clarity, et al., 2003), A Dictionary of Syrian Arabic, English-Arabic (Stowasser and Ani, 2004) and a Dictionary of Moroccan Arabic, Arabic-English, English-Arabic (Harrell and Sobelman, 2004). Utilizing contemporary principles of computational linguistics and current pedagogical requirements in order to reflect current vocabulary and usage, the work will provide a standardized system of transcription and use the Arabic script, both vocalized and unvocalized, to show vowel pronunciation as well as standard orthography. A searchable version on CD-ROM will accompany each print reference. The project has been funded for three years. Work will commence in Year 1 with the Iraqi Arabic dictionary, proceed to the Syrian/Levantine dictionary and conclude with the Moroccan Arabic dictionary.

The proposed dictionaries and databases aim to provide U.S. students and teachers of Arabic with current dialectal Arabic lexical information to enable them to communicate orally with native and non-native Arabic speakers. The scholarship used to create a modernized transcription system and to provide existing and new terms in Arabic script (including diacritics) may also help integrate instruction in dialect and Modern Standard Arabic by providing tools for curriculum developers.

[ top ]


2007 Member Survey Responses

Please click here to access a summary of the responses to Questions 1-15 of the 2007 Member Survey. These questions were sent to all survey recipients.

We also received many suggestions for future releases, among them:

  • More African language publications
  • Gigaword corpora in additional languages
  • More annotated data for a greater variety of uses
  • More parallel text corpora
  • Web blogs and chat room data

Some of those requests represent data in our 2008 publications pipeline.

The winner of the blind drawing for the $500 benefit for survey responses received by January 14, 2008 is Richard Rose of McGill University. Congratulations!

To all survey respondents: As promised, a more detailed analysis of the survey will be arriving within the next few weeks. Stay tuned!

[ top ]


50,000th LDC Corpus Distributed!

Last year marked the LDC's 15th Anniversary Year and it proved to be an exciting one for the LDC. We commemorated this anniversary with a Fidelity Celebration which rewarded our loyal members who continually support the consortium through membership. Additionally, we provided our list serve readers with a glimpse into the research activities at the LDC through each of our monthly Spotlights.

At the very end of our anniversary year, the LDC observed another significant milestone: the distribution of our 50,000th publication! This corpus was licensed by Helsinki University of Technology, Adaptive Informatics Research Centre (AIRC). AIRC's research includes basic algorithmic analysis, multimodal interfaces (speech, vision and language), bioinformatics, neuroinformatics and computational cognitive systems. In appreciation, the LDC is offering Helsinki University of Technology a US$2000 benefit to be used towards membership or data licensing fees.

We would like to thank both members and nonmembers for helping the LDC reach this landmark distribution. Your persistent demand for LDC data supports our mission to develop and share resources for research in human language technologies.

[ top ]


15th Anniversary Fidelity Celebration

As promised in our June 2007 newsletter, the LDC is holding a Fidelity Celebration in honor of our 15th Anniversary. We want to thank our members for their commitment in establishing and supporting our consortium and we have decided that rewarding your loyalty is the best way to do it!

Eligibility - Organizations that have been consecutive members of the LDC since at least 2005 (through 2007 inclusive) are eligible for benefits that can be used for corpora purchases, reduced license fees, extra copy fees or membership discounts; it is entirely up to you!

Here’s how it works:

* Any organization that has been a member for 3-4 consecutive years (2004, 2005 through 2007) is eligible to receive a $250 benefit

* Non-Profit organizations that have been members for 5-9 consecutive years (1999 - 2003 through 2007) are eligible to receive a $500 benefit while For-Profit organizations are eligible to receive a $1500 benefit

* Non-Profit organizations that have been members for 10-15 consecutive years (1993 – 1998 through 2007) are eligible to receive a $3500 benefit while For-Profit organizations that have been members for 10-15 consecutive years (1993 – 1998 through 2007) are eligible to receive a $7500 benefit

Notification and Terms – The primary contacts at each qualifying organization were notified on June 20, 2007 via email. One benefit award will be made for every 20 organizations in each group. Therefore, if there are 23 members who have been consecutive members for 3-4 years, 1 prize will be awarded. If 49 members are eligible, 2 prizes will be awarded, etc. The blind drawing will be held on July 2, 2007, and winners will be immediately notified. The benefits are awarded to the member organization as a whole and must be used during calendar year 2007 on purchases made after notification. All benefits expire on December 31, 2007.

Redemption - In order to redeem your benefit, please notify the Membership Coordinator, Ilya Ahtaridis, at the time of your order at ldc@ldc.upenn.edu. All other Fidelity Celebration concerns may be directed to Marian Reed at mreed@ldc.upenn.edu.

Thank you for your continued support from all of us at the LDC!

[ top ]

Celebrating 15 Years of Supporting the Language Technology Community

April 15, 2007 marks the start of the LDC's 15th Anniversary year! We have many milestones to celebrate including the growth of our staff to include over 40 full-time employees and the an online catalog that includes over 350 linguistic databases. Since 1992, no less than 2,300 organizations from over 80 different nations have licensed LDC data.

Numbers aside, it is essential to note how greatly the LDC has evolved while still adhering to our goal to share language-technology resources. Our mission has grown to include linguistic data collection and annotation for an increasing number of areas of language research and engineering, as well as the development of language-related standards and tools. By collecting and creating data that we distribute, the LDC remains responsive to the changing needs of the research community that it has supported for fifteen years.

In each of our monthly newsletters, we will highlight one aspect of the LDC - from our work in human subject collections, to our progress in Arabic treebanking, to the technical challenges of collecting and storing high volumes of broadcast news.

As we celebrate throughout the year, look for new membership offerings and announcements. And be sure to join us as we count down to the much anticipated distribution of our 50,000th publication.

[ top ]

LDC Membership Options


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Monday, 23-Nov-2009 10:06:31 EST
© 1992-2009 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.