Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome


Recent Announcements from the LDC

2007 Member Survey Responses

Please click here to access a summary of the responses to Questions 1-15 of the 2007 Member Survey. These questions were sent to all survey recipients.

We also received many suggestions for future releases, among them:

  • More African language publications
  • Gigaword corpora in additional languages
  • More annotated data for a greater variety of uses
  • More parallel text corpora
  • Web blogs and chat room data

Some of those requests represent data in our 2008 publications pipeline.

The winner of the blind drawing for the $500 benefit for survey responses received by January 14, 2008 is Richard Rose of McGill University. Congratulations!

To all survey respondents: As promised, a more detailed analysis of the survey will be arriving within the next few weeks. Stay tuned!

[ top ]


2008 Publications Pipeline

Membership Year (MY) 2008 is shaping up to be another productive one for the LDC. We anticipate releasing a balanced and exciting selection of publications. Here is a glimpse of what is in the pipeline for MY2008. (Disclaimer: unforeseen circumstances may lead to modifications of our plans. Please regard this list as tentative).


• BLLIP 1994-1997 News Text Release 1 - automatic parses for the North American News Text Corpus - NANT (LDC95T21). The parses were generated by the Charniak and Johnson Reranking Parser which was trained on Wall Street Journal (WSJ) data from Treebank 3 (LDC99T42). Each file is a sequence of n-best lists containing the top n parses of each sentence with the corresponding parser probability and reranker score. The parses may be used in systems that are trained off labeled parse trees but require more data than found in WSJ. Two versions will be released: a complete 'Members-Only' version which contains parses for the entire NANT Corpus and a 'Non Member' version for general licensing which includes all news text except data from the Wall Street Journal.


• Chinese Proposition Bank - the goal of this project is to create a corpus of text annotated with information about basic semantic propositions. Predicate-argument relations are being added to the syntactic trees of the Chinese Treebank Data. This release contains the predicate-argument annotation of 81,009 verb instances (11,171 unique verbs) and 14,525 noun instances (1,421 unique nouns). The annotation of nouns are limited to nominalizations that have a corresponding verb.


• GALE Phase 1 Arabic Newsgroup Parallel Text - contains a total of 178K words (264 files) of Arabic newsgroup text selected from 35 sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. Manual sentence units/segments (SU) annotation was also performed on a subset of files following LDC's Quick Rich Transcription specification. Files were translated according to LDC's GALE Translation guidelines.


• GALE Phase 1 Chinese Newsgroup Parallel Text - contains a total of 240K characters (112 files) of Chinese newsgroup text selected from 25 sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. Manual sentence units/segments (SU) annotation was also performed on a subset of files following LDC's Quick Rich Transcription specification. Files were translated according to the LDC's GALE Translation guidelines.


• Hindi WordNet - first wordnet for an Indian language. Similar in design to the Princeton Wordnet for English, it incorporates additional semantic relations to capture the complexities of Hindi. The WordNet contains 28604 synsets and 63436 unique words. Created by the NLP group at Indian Institute of Technology Bombay, it is inspiring construction of wordnets for many other Indian languages, notably Marathi.


• LCTL Bengali Language Pack - a set of linguistic resources to support technological improvement and development of new technology for the Bengali language created in the Less Commonly Taught Languages (LCTL) project which covered a total of 13 languages. Package components are: 2.6 million tokens of monolingual text, 500,000 tokens of parallel text, a bilingual lexicon with 48,000 entries, sentence and word segmenting tools, an encoding converter, a part of speech tagger, a morphological analyzer, a named entity tagger and 136,000 tokens of named entity tagged text, a Bengali-to-English name transliterator, and a descriptive grammar created by a PhD research linguist. About 30,000 tokens of the parallel text are English-to-LCTL translations of a "Common Subset" corpus, which will be included in all additional LCTL Language Packs.


• North American News Text Corpus (NANT) Reissue - as a companion to BLLIP 1994-1997 News Text Release 1, LDC will reissue the North American News Text Corpus (LDC95T21). Data includes news text articles from several sources (L.A.Times/Washington Post, Reuters General News, Reuters Financial News, Wall Street Journal, New York Times) that has been formatted with TIPSTER-style SGML tags to indicate article boundaries and organization of information within each article. Two versions will be released: a complete 'Members-Only' version which contains all previously released NANT articles and a 'Non Member' version for general licensing which includes all news text except data from the Wall Street Journal.


As a reminder, MY2007 will remain open for joining through December 31, 2008 and MY2008 through December 31, 2009. Take note that some of our current discounts on Membership Fees will be no longer be effective after March 1, 2008.

[ top ]


50,000th LDC Corpus Distributed!

Last year marked the LDC's 15th Anniversary Year and it proved to be an exciting one for the LDC. We commemorated this anniversary with a Fidelity Celebration which rewarded our loyal members who continually support the consortium through membership. Additionally, we provided our list serve readers with a glimpse into the research activities at the LDC through each of our monthly Spotlights.

At the very end of our anniversary year, the LDC observed another significant milestone: the distribution of our 50,000th publication! This corpus was licensed by Helsinki University of Technology, Adaptive Informatics Research Centre (AIRC). AIRC's research includes basic algorithmic analysis, multimodal interfaces (speech, vision and language), bioinformatics, neuroinformatics and computational cognitive systems. In appreciation, the LDC is offering Helsinki University of Technology a US$2000 benefit to be used towards membership or data licensing fees.

We would like to thank both members and nonmembers for helping the LDC reach this landmark distribution. Your persistent demand for LDC data supports our mission to develop and share resources for research in human language technologies.

[ top ]


Membership Fee Increases and Discounts

Effective January 1, 2008, the LDC will be raising membership fees for the first time in fifteen years. Note that the new fee structure rewards those members who keep their membership current and members who join early in the year through discounts on membership. The details are as follows:

  • Organizations who joined for Membership Year (MY) 2007, will receive a 5% discount when renewing. This discount will apply throughout 2008, regardless of time of renewal. MY 2007 members renewing before March 1, 2008 will receive a 10% discount when renewing.

  • New members as well as members who did not renew for MY 2007, but who held membership in any of the previous MY's (1993-2006), will also be eligible for a 5% discount provided that they join/renew before March 1, 2008.

Please consult the Membership Fee Table for exact pricing information.


New 2008 FeeNew Fee with 5% DiscountNew Fee with 10% Discount
Not-for-Profit
StandardUS$2400US$2280US$2160
SubscriptionUS$3850US$3657.50US$3465
For-Profit
StandardUS$24000US$22800US$21600
SubscriptionUS$27500US$26125US$24750

Additional points:

  • LDC Online Membership fees will remain the same.

  • MY 2006 is now closed for joining; MY 2007 will remain open through December 31, 2008.

  • Organizations may join for MY's in advance.

Should you require any additional information, please contact our Membership group by e-mail at
ldc@ldc.upenn.edu or by phone at +1 (215) 573-1275.

[ top ]


15th Anniversary Fidelity Celebration

As promised in our June 2007 newsletter, the LDC is holding a Fidelity Celebration in honor of our 15th Anniversary. We want to thank our members for their commitment in establishing and supporting our consortium and we have decided that rewarding your loyalty is the best way to do it!

Eligibility - Organizations that have been consecutive members of the LDC since at least 2005 (through 2007 inclusive) are eligible for benefits that can be used for corpora purchases, reduced license fees, extra copy fees or membership discounts; it is entirely up to you!

Here’s how it works:

* Any organization that has been a member for 3-4 consecutive years (2004, 2005 through 2007) is eligible to receive a $250 benefit

* Non-Profit organizations that have been members for 5-9 consecutive years (1999 - 2003 through 2007) are eligible to receive a $500 benefit while For-Profit organizations are eligible to receive a $1500 benefit

* Non-Profit organizations that have been members for 10-15 consecutive years (1993 – 1998 through 2007) are eligible to receive a $3500 benefit while For-Profit organizations that have been members for 10-15 consecutive years (1993 – 1998 through 2007) are eligible to receive a $7500 benefit

Notification and Terms – The primary contacts at each qualifying organization were notified on June 20, 2007 via email. One benefit award will be made for every 20 organizations in each group. Therefore, if there are 23 members who have been consecutive members for 3-4 years, 1 prize will be awarded. If 49 members are eligible, 2 prizes will be awarded, etc. The blind drawing will be held on July 2, 2007, and winners will be immediately notified. The benefits are awarded to the member organization as a whole and must be used during calendar year 2007 on purchases made after notification. All benefits expire on December 31, 2007.

Redemption - In order to redeem your benefit, please notify the Membership Coordinator, Ilya Ahtaridis, at the time of your order at ldc@ldc.upenn.edu. All other Fidelity Celebration concerns may be directed to Marian Reed at mreed@ldc.upenn.edu.

Thank you for your continued support from all of us at the LDC!

[ top ]

Celebrating 15 Years of Supporting the Language Technology Community

April 15, 2007 marks the start of the LDC's 15th Anniversary year! We have many milestones to celebrate including the growth of our staff to include over 40 full-time employees and the an online catalog that includes over 350 linguistic databases. Since 1992, no less than 2,300 organizations from over 80 different nations have licensed LDC data.

Numbers aside, it is essential to note how greatly the LDC has evolved while still adhering to our goal to share language-technology resources. Our mission has grown to include linguistic data collection and annotation for an increasing number of areas of language research and engineering, as well as the development of language-related standards and tools. By collecting and creating data that we distribute, the LDC remains responsive to the changing needs of the research community that it has supported for fifteen years.

In each of our monthly newsletters, we will highlight one aspect of the LDC - from our work in human subject collections, to our progress in Arabic treebanking, to the technical challenges of collecting and storing high volumes of broadcast news.

As we celebrate throughout the year, look for new membership offerings and announcements. And be sure to join us as we count down to the much anticipated distribution of our 50,000th publication.

[ top ]

LDC Membership Options


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 22-Apr-2008 12:19:52 EDT
© 1992-2007 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.