March 2010 Newsletter

Wednesday, March 17, 2010

New Corpora

NPS Internet Chatroom Conversation, Release 1.0



Membership Mailbag: Using LDC data

LDC's Membership office responds to thousands of emailed queries a year, and, over time, we've noticed that some questions tend to crop up with regularity.  To address the questions that you, our data users, have asked, we'd like to continue our Membership Mailbag series of newsletter articles.  This month, we'll review commonly asked questions about using LDC data, with an emphasis on handling audio files.

The LDC distributes corpora in two ways: CD- or DVD-ROM's shipped to users, and GNU compressed tar files (tgz) which are made available through our Intranet.   Generally, corpora which are smaller than 250 MB are distributed via web download.  These are typically text-only, such as transcriptions or lexicons. Larger text and speech corpora are distributed on CD or DVD-ROM. Data users can consult the Using page for further information about dealing with GNU tar files and other compressed data.   This page also provides some basic information about the formatting of our text corpora.   Since formatting can vary greatly among text corpora, each LDC text corpus includes detailed documentation about the text format being provided.

Nearly all LDC speech corpora are published with the speech files in NIST SPHERE format; this involves a simple, flexible and self-describing file header followed by the raw sample data. The header provides important information, in human-readable text, about the speech data in the file, such as the number of samples, the sampling rate, the number of channels, and the kind of sample encoding, as well as whether the speech data are compressed or not. SPHERE files can be manipulated on most UNIX systems using software provided by NIST.  For users of other operating systems, LDC provides two programs that will convert SPHERE files to other formats:

    sph_convert_2.1: For converting lots of files at once. Suitable for Windows systems. Makes batch conversion at corpus level simpler, but provides less flexibility and control.

     sph2pipe_v2.5:  For converting one file at a time. Provides more flexibility and control and is suitable for use on all operating systems (Windows, Linux, MacOS X, etc.). Its simple command-line interface efficiently supports a wide range of options for batch processing and program control of file conversions.

Another, more powerful tool for waveform file conversion, is the SoX utility maintained at SoX is a cross-platform command line utility that can convert various formats of computer audio files into other formats as well as check sampling rate and sample format of the audio content.

Got a question?  About LDC data?  Forward it to  The answer may appear in a future Membership Mailbag article.