October 2009 Newsletter

Friday, September 12, 2014

New Corpora

2007 NIST Language Recognition Evaluation Test Set 

OntoNotes 3.0

Web 1T 5-gram, 10 European Languages Version 1



Things We Heard at Interspeech. . .
Thanks to those who visited our booth at Interspeech2009 in Brighton! We learned a lot from the conference and were happy to make new friends and to reconnect with old members, friends and associates. We heard a few questions more than once and thought we should share our responses here.

1. Does LDC need to exhibit at Interspeech? Everyone knows who you are.

LDC’s mission is to make data easily available and this requires presenting our work to the community. At Interspeech, via our exhibition table, we introduced XTrans, our latest transcription tool, and a new kind of language resource, Broadcast Narrow Band Speech (BNBS). We also met a number of new researchers in the community and showed some old friends new and unfamiliar resources.

2. Don’t you ever give anything away for free?

Yes we do! As we mentioned in last month’s newsletter, we have given away over 1300 copies corpora, including releases sponsored by Talkbank, the Unified Linguistic Annotation Text Collection, TimeBank 1.2, FactBank 1.0 and Buckwalter Morphological Analyzer 1.0. Furthermore, we distribute tools at no cost and many are open source. These include XTrans, AGTK, Champollion and SPHERE converter tools. See our What’s New! What’s Free! page for more information.

3. Aren’t you afraid that having a display makes LDC seem corporate?

Not at all! We felt it could be beneficial to host an Interspeech display in order to introduce some important new resources and provide an opportunity to meet directly with our members, understand their data requests and determine if we can meet them. LDC members are research and technology developers found in nonprofit, government and commercial sectors. Our goal is to meet the widest possible range of their needs.

LDC Data Sheets Now Available Online
In early 2009, LDC crafted data sheets to describe in concise form current and past projects, daily operations and our technical capabilities. Print versions of these documents debuted at Interspeech 2009 and have received positive feedback for both their content and design.

The data sheets were distributed on FSC certified 30% recycled paper and were printed using environmentally-friendly toner.  FSC certification means that the process that developed the paper, from seed to final sheet, is in compliance with international laws and treaties so that it employs fair labor standards and respects and conserves environmental resources.

LDC intends to expand the breadth of data sheet categories and the depth of information provided within each category. This will help to accurately represent our organization and highlight our staff’s research and development effo

65,000th Corpus Distribution Countdown

The countdown to the distribution of our 65,000th corpus has officially begun!  Almost two years after observing our 50,000th distribution, LDC is nearing another milestone.  Which organization will be the recipient of our 65,000th publication? Continue to send in your requests for data!

Membership Mailbag: Is LDC Membership a Good Deal?

LDC's Membership Office responds to thousands of emailed queries a year, and, over time, we've noticed that some questions tend to crop up with regularity.  To address the questions that you, our data users, have asked, we'd like to continue our Membership Mailbag series of newsletter articles.  This month, we'll consider the reasons why LDC membership remains a smart use of funding.

LDC's extensive catalog is unmatched. It spans 17 years and includes over 400 multilingual speech, text, and video resources.  LDC membership is an economical way to acquire multiple datasets from our catalog. In 2008 membership fees for university and research organizations were less that 9% of the total non-member cost of acquiring each 2008 publication separately.  So, even if an organization only needs a few datasets from a given membership year, membership may be the most cost-effective way to obtain these corpora.  Additionally, the generous discounts that member organizations receive on older corpora reduce the cost of acquiring such datasets.

LDC data has unlimited cross-departmental use within a university or organization.  Since there is no difference in cost between a departmental membership and one that is organization-wide, departments can combine resources and establish one LDC membership for use by the entire organization.  This helps university departments with smaller research budgets find more data within their reach.

An organization's license to LDC data is perpetual. Data acquired by LDC members during their membership years can be used anytime thereafter.   If a university researcher had established a membership in 1999, a student at that university in 2009 can use the same membership data without additional cost.  If a member's research focus or staff changes, that member can still take advantage of free data from the particular membership year(s).

Got a question?  About LDC data?  Forward it to .  The answer may appear in a future Membership Mailbag article.

LDC Incentives Package in the Works

LDC encourages organizations that use the data we distribute to become members of the consortium.  For the reasons detailed above, membership is the most cost-effective and easiest way to obtain LDC data.  In the coming months, LDC will be rolling out a host of additional incentives for both members and non-members.  These incentives will help lower the costs to license some LDC data as well as sweeten the deal for LDC members.  Stay tuned...