Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
Membership Mailbag Archive

LDC's membership office responds to thousands of emailed queries a year, and, over time, we've noticed that some questions tend to crop up with regularity. To address the questions that our data users have asked, we introduced our Membership Mailbag series of newsletter articles in our May 2008 newsletter. This periodic series answers frequently asked questions about LDC data, the LDC Intranet, and the benefits of an LDC membership.


Using LDC Data - March 17, 2010


This month, we'll review commonly asked questions about using LDC data, with an emphasis on handling audio files.

The LDC distributes corpora in two ways: CD- or DVD-ROM's shipped to users, and GNU compressed tar files (tgz) which are made available through our Intranet.   Generally, corpora which are smaller than 250 MB are distributed via web download.  These are typically text-only, such as transcriptions or lexicons. Larger text and speech corpora are distributed on CD or DVD-ROM. Data users can consult the Using page for further information about dealing with GNU tar files and other compressed data.   This page also provides some basic information about the formatting of our text corpora.   Since formatting can vary greatly among text corpora, each LDC text corpus includes detailed documentation about the text format being provided.

Nearly all LDC speech corpora are published with the speech files in NIST SPHERE format; this involves a simple, flexible and self-describing file header followed by the raw sample data. The header provides important information, in human-readable text, about the speech data in the file, such as the number of samples, the sampling rate, the number of channels, and the kind of sample encoding, as well as whether the speech data are compressed or not. SPHERE files can be manipulated on most UNIX systems using software provided by NIST.  For users of other operating systems, LDC provides two programs that will convert SPHERE files to other formats:

  • sph_convert_2.1: For converting lots of files at once. Suitable for Windows systems. Makes batch conversion at corpus level simpler, but provides less flexibility and control.

 

  • sph2pipe_v2.5:  For converting one file at a time. Provides more flexibility and control and is suitable for use on all operating systems (Windows, Linux, MacOS X, etc.). Its simple command-line interface efficiently supports a wide range of options for batch processing and program control of file conversions.

Another, more powerful tool for waveform file conversion, is the SoX utility maintained at SourceForge.net. SoX is a cross-platform command line utility that can convert various formats of computer audio files into other formats as well as check sampling rate and sample format of the audio content.

[ top ]

 

Is LDC Membership a Good Deal? - October 20, 2009


This month, we'll consider the reasons why LDC membership remains a smart use of funding.

LDC's extensive catalog is unmatched. It spans 17 years and includes over 400 multilingual speech, text, and video resources. LDC membership is an economical way to acquire multiple datasets from our catalog. In 2008 membership fees for university and research organizations were less that 9% of the total non-member cost of acquiring each 2008 publication separately. So, even if an organization only needs a few datasets from a given membership year, membership may be the most cost-effective way to obtain these corpora. Additionally, the generous discounts that member organizations receive on older corpora reduce the cost of acquiring such datasets.

LDC data has unlimited cross-departmental use within a university or organization. Since there is no difference in cost between a departmental membership and one that is organization-wide, departments can combine resources and establish one LDC membership for use by the entire organization. This helps university departments with smaller research budgets find more data within their reach.

An organization's license to LDC data is perpetual. Data acquired by LDC members during their membership years can be used anytime thereafter. If a university researcher had established a membership in 1999, a student at that university in 2009 can use the same membership data without additional cost. If a member's research focus or staff changes, that member can still take advantage of free data from the particular membership year(s).

[ top ]

Navigating the LDC Intranet Part 2 - May 22, 2009

Last month we focused on a few features of the LDC Intranet including establishing an account and using that account to access information about your organization's history with LDC. This month, we'll take a look into using your account to access password-protected corpora and resources.

LDC's Intranet contains the following links:

  • User
  • Customer Profile
  • LDC Online
  • Corpora Available for Download

LDC Online and Corpora Available for Download sections. After registering for an LDC Intranet account, users can access LDC Online both through the LDC Intranet and the LDC Online page on LDC's website. LDC Online contains an indexed collection of Arabic, Chinese and English newswire text, millions of words of English telephone speech from the Switchboard and Fisher collections and the American English Spoken Lexicon, as well as the full text of the Brown corpus.

To download corpora that your organization has licensed, visit the Corpora Available for Download section. This section contains all web-download corpora the organization has licensed, with the most recently invoiced requests listed first. Any registered user of an organization can utilize the web-download service at any time to view and access the corpora that have been invoiced for delivery over the web. This section will not contain all corpora that an organization has licensed, only those small enough for web-download.

Recently, LDC has made available for web-download some popular resources which were previously distributed only on disc. These resources include TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1), CELEX2 (LDC96L14), and Treebank-3 (LDC99T42). If an organization has obtained a license to any of these resources, registered users can simply log in to download the data, thereby eliminating the need to locate the copy on disc or license a new copy.

[ top ]

Navigating the LDC Intranet Part 1 - April 16, 2009

This month we will focus on a few features of the LDC Intranet including establishing an account and using that account to access information about your organization's history with LDC. Next month, we'll take a look into using your account to access password-protected corpora and resources.

LDC's Intranet contains the following links:

  • User
  • Customer Profile
  • LDC Online
  • Corpora Available for Download

The User and Customer Profile sections. Anyone can sign up for a 'guest account' to the LDC Intranet either through the Member Resources page or through the LDC Online page on LDC website. When signing up for an account, you'll be asked to select your organization affiliation from a list of over 2700 organizations that have licensed data from LDC. If your organization doesn't appear on the list, you can register under the organization 'Guest'. Once your account is established under an organization name, the organization administrator for that account receives an automated email which requests that the administrator verify your organization affiliation and change your account permission from 'guest' to 'org_user'.

As an 'org_user', you can access more information about your organization and, generally, more data. If you are receiving this email, then you already have an account to the LDC Intranet. Don't recall signing up for an account? If you have licensed data from LDC in the past, then an account was automatically created for you. Account holders should use the User link to update their contact and log-in information.

After your account is established and verified, you can next view information about your organization through the Customer Profile link. The Customer Profile shows the 'Primary Contact' at each organization - for organizations which are LDC members, that contact is for membership and data inquiries; for non-member organizations the 'Primary Contact' is the first person to have licensed data under that organization name. If your organization is an LDC member, the Customer Profile will list which years your organization has held membership under the 'Membership Year(s)' section. The Customer Profile also shows which data your organization has licensed under the 'Catalog Information' section. For member organizations, the profile will list all corpora which are included in their Membership Year(s). If a corpus has been requested, then the profile will indicate who requested the corpus and when.

In the next newsletter, we'll look at using an LDC Intranet account to access password-protected corpora and resources, as we review the LDC Online and Corpora Available for Download sections.

Commercial Use of LDC Data - March 17, 2009

This month we will focus on commercial rights to LDC data, with an emphasis on the LDC For-Profit membership.

To help clarify commercial use of LDC data, let's look at a few examples in which a commercial organization licenses LDC data. In the first scenario, a company, TryFirst JoinLater LLC., licenses data as a non-member. At this point, the company is not an LDC member and cannot use LDC data for any commercial purpose. Some years later, TryFirst JoinLater decides to join LDC as a For-Profit member. Do they now have commercial rights to the data licensed as a non-member? Yes, by joining the LDC, TryFirst JoinLater gains commercial rights to any data already licensed, unless those rights are otherwise restricted by a corpus-specific user license. In short, a commercial organization can first license data as a non-member for research purposes and then join LDC to gain commercial rights to that data.

Second scenario. Another company, Join Only Once, Ltd., decides to join LDC as a For-Profit Member for Membership Year 2009. What data will this company be able to use for commercial purposes? As 2009 member, Join Only Once will gain commercial rights to data from the year that they have joined, that is, Membership Year 2009, unless otherwise restricted by a corpus-specific user license. Furthermore, while a member of the current year, Join Only Once can license data for commercial use from the closed Membership Years (1993-2007) at the Reduced Licensing Fee. Join Only Once, Ltd. retains ongoing commercial rights to data it licenses as a For-Profit member. Fast forward a few years - Join Only Once has not renewed their LDC membership but they would like to obtain some additional data not from their Membership Year. If Join Only Once does not renew their LDC membership, they will not have a commercial license to any new data obtained after their Membership Year has ended.

Which leads us to our final scenario. A third company, Best LDC Member Ever! Corporation, has been a For-Profit LDC member since our inception in 1992. Does this company have commercial rights to all LDC data? No, there are a few caveats to note. All members are reminded to consult corpus-specific license agreements for limitations, including commercial restrictions, on the use of certain corpora. In the case of a small group of corpora that includes American National Corpus (ANC) Second Release (LDC2005T35), Buckwalter Arabic Morphological Analyzer Version 2.0 (LDC2004L02) and all CSLU corpora, commercial licenses must be obtained separately from the owners of the data. A full list of corpus-specific user licenses can be found on our License Agreements page.

[ top ]

LDC Corpus Catalog Naming Conventions - December 19, 2008

This month we will focus on LDC corpus catalog naming conventions, specifically on distinguishing among corpora with similar names.

The LDC Corpus Catalog contains corpora which have been used in research projects, in some cases including benchmark tests carried out under government sponsorship.  These include data from older projects such as TIPSTER and HUB4 to current programs like ACE and GALE.  Additionally, our catalog contains data donated to the LDC by a sizable group of corpus authors from organizations around the globe.   This varied origin of LDC data can make for a potentially confusing array of corpus catalog names.  However, a few general rules are observed to differentiate among corpora with similar names.  

The corpus names for data from one collection or research effort that contains unique data will include terms such as 'part', 'phase', 'set', and 'volume'.  For example,  Arabic Treebank: Part 1 v 3. will contain different source data than Arabic Treebank: Part 4 v 1.0.  Likewise, Fisher English Training Speech Part 1 Speech contains different data than Fisher English Training Part 2, Speech.  Occasionally, the catalog will contain a database such as Levantine Arabic QT Training Data Set 5, Speech which has a corresponding Set 4 and Set 3, but not a Set 2 or Set 1.  In such cases, the earlier data may have only been released for evaluation purposes or may have been incorporated into later publications.

The corpus names of enhanced data sets or those with previously unreleased data will include 'version', 'release', 'edition', or simply, a number such as '2.0'.  For example, English Gigaword Third Edition  contains all of the data from both English Gigaword Second Edition and English Gigaword.  Likewise, Chinese Treebank 6.0 should be considered an update of all previous Chinese Treebanks - additional data has been included and known errors corrected.  Not all newer corpora will also have a corresponding earlier corpus in the LDC catalog, an example being Switchboard-1 Release 2.  Such earlier data sets may have not been available for general licensing through the LDC or may have been completely superseded by a later corpus.

[ top ]

Early Broadcast News Evaluation Efforts - October 17, 2008

This month, we'll focus on early HUB4 broadcast news evaluation corpora and their continued relevance to current speech recognition research.

The HUB4 evaluations, administered by the Defense Advanced Research Projects Agency (DARPA) in the mid -1990s, was a research program for continuous speech recognition which focused on automatic transcription of broadcast news.  The 1996 evaluation represented the first attempt to utilize 'found' speech as opposed to 'elicited' speech.  Earlier evaluations relied largely on recordings of human subjects reading journalistic text or supplying a new report based on the text - in other words, speech created specifically for purposes of the evaluation.  The 1996 HUB4 studies employed speech that occurred naturally in daily use by focusing on recordings from broadcast news agencies such as ABC, CNN, and CSPAN.  These recordings made the test data representative of actual 'real-world' conditions.

The 1996 HUB4 data in the LDC catalog consists of 3 hours of development data and 2.5 hours of evaluation data distributed as LDC97S66, and approximately 100 hours of training data distributed as LDC97S44LDC97T22 contains the corresponding transcripts for both publications.  The 1997 HUB4 data includes an additional 97 hours of training data distributed as LDC98S71 with corresponding transcripts, LDC98T28.  Researchers interested in benchmarking against the evaluations should compare their results to the 1997 evaluation since sites participating in the 1996 evaluation were given access only to a 50 hour subset of training data.  Anonymized results can be obtained by contacting the NIST Speech Group.

The LDC has released additional larger broadcast news collections as part of the topic and detection tracking (TDT) project.  However, because the focus of the TDT project was information mining and retrieval, the earlier HUB4 collections remain ideal for continuous speech recognition research in the broadcast domain.  HUB4 reference transcripts include detailed information on recording conditions and speaker characteristics; they are manually timestamped at the utterance-level and generally more accurate than those used for TDT.

You can view all of our HUB4 data collections, including resources in Spanish and Mandarin Chinese, by searching for 'HUB4' on the LDC catalog search page.  Further information is also available at the Broadcast News Recognition Evaluation website from the NIST Speech Group.

[ top ]

What is a Reduced Licensing Fee? - June 16, 2008

This month, we'll elaborate on the 'What is a Reduced Licensing Fee?’ question on our Members FAQ.

One of the benefits of being a member in the current Membership Year (MY 2008) is the ability to license LDC data from closed MYs at a discount,  what is termed the 'Reduced Licensing Fee'. This fee is listed on each catalog page and is generally 50% of the Non-member Fee.  The Reduced Licensing Fee does not apply to corpora from the previous MY (MY 2007), since that year is still open for joining.  One common misinterpretation is that the Reduced Licensing Fee is discounted pricing for non-commercial organizations.  However, non-members, including educational and not-for-profit organizations, must pay the full Non-member Fee for any data they license.

A few of our corpora are only available to members of the LDC.  Organizations that have joined MY 2008 can still license such ‘Members Only’ corpora even if they did not join the MY(s) that apply to the corpora.  Since Members Only corpora do not have corresponding Non-member Fees, their Reduced Licensing Fees are calculated based on corpus size:  US$200 per DVD-ROM, US$150 per CD-ROM, and US$100 per web download.  This pricing information is listed at the bottom of each Members Only catalog entry.

[ top ]


'Penn' Treebanks and Recent Directions in English Treebanking - May 19, 2008

 

This month we will look into the differences between the 'Penn' Treebanks and review recent directions in English treebanking.

Treebank-2 and Treebank-3 both contain 1 million words of Wall Street Journal (WSJ) text  and a small sample of ATIS-3 data that have been annotated using a Treebank II annotation-style, plus a part-of-speech tagged version of the Brown corpus.  Treebank-3 is considered a super-set of Treebank-2.  That is, if you are undecided between Treebank-2 and -3, in most instances, the best choice would be Treebank-3. Treebank-3 corrects known technical errors in Treebank-2 plus it contains Switchboard data which has been tagged, dysfluency-annotated, and a small portion of the Brown corpus which has been parsed in the Treebank II annotation-style.   

Note, however, that there are a few items missing from Treebank-3 that are found in Treebank-2.  Treebank 3 does not contain the complete parsed Brown corpus.  Treebank-2 contains the complete parsed Brown corpus done in the older Treebank I annotation-style; this is not contained in Treebank-3. Also, Treebank-3 does not include the tgrep software for extracting data, but tgrep and a newer version, tgrep2, are freely available online.  Finally, Treebank-3 does not contain the raw Wall Street Journal (WSJ) text, but organizations can obtain this by request.

Much recent treebanking has focused on languages other than English, but English treebanking efforts did not come to an end with the release of Treebank-3.  Ongoing work uses an updated Treebank II annotation-style and consists of two types of annotation; straight treebanking and treebanking in combination with another kind of annotation.  Straight treebank annotation can be found in corpora such as English Chinese Translation Treebank v 1.0 and English-Arabic Treebank v 1.0.  In these corpora, the Chinese or Arabic source texts have been translated into English, then POS-tagged and treebanked, thus making them suitable for machine translation work as well.  Additional translation treebanks are planned for release and will feature cleaner translation and contain substantially more data. 

Corpora which combine treebanking with another type of annotation include the English Conversational Telephone Speech Treebank with Structural Metadata, to be released later this year.  This treebank is annotated for structural metadata including fillers, disfluencies and sentence/semantic units, and also tagged for syntactic structure, and so, evaluates the impact of metadata extraction (MDE) on parsing information.  While these newer releases are smaller than the Penn Treebanks, the improved Treebank II annotation-style has a very high rate of inter-annotator agreement..  Additionally, the source texts are more varied in both domain and style than the WSJ texts that constitute the bulk of Penn Treebank.

 

[ top ]


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Wednesday, 17-Mar-2010 14:17:14 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.