Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources
Using LDC Data

The LDC distributes corpora in two ways: CD- or DVD-ROM's shipped to users, and UNIX TAR files made available to users web download through the LDC Intranet. CD-ROM's and DVD-ROM's contain text and/or speech data, presented in a machine-independent directory structure (ISO 9660); these discs are the only practical media for data collections involving hundreds or thousands of megabytes (MB).

TAR files via web download are only practical for relatively small data sets (250 MB or less); these are typically text-only, such as transcriptions or lexicons. (Sometimes, we produce CD-ROM's that contain TAR files.)

Whether the distribution involves speech or text, via CD-ROM or web download, we often apply data compression to make the package smaller and easier to distribute. This page describes tools and procedures helpful for handling the data distributed by the LDC.

Extraction from TAR Files

Software to uncompress and unpack a UNIX TAR file is available for the most common computer systems (Microsoft DOS/Windows and Macintosh). If you have a non-UNIX machine and you don't yet have a program to handle TAR files, search the web (www.gnu.org, www.simtel.net, www.aladdinsys.com are fairly stable sources). Some sources provide free downloads, others provide "shareware" (try now, pay later), and a few try to do a regular retail business with modest purchase prices.

GNU Software for MS-Windows and MS-DOS on CD-ROM is now also available upon request from the LDC, and we highly recommend this package for simplicity and ease of use. Another good (free) source that provides a fairly complete UNIX/Linux emulation package for MS-Windows is Cygwin: www.cygwin.com

The instructions to follow on non-UNIX machines will depend on which "brand" of software is being used -- read the manual. (Don't worry, Mac and Wintel applications usually try very hard to make the process as simple as possible.)

UNIX users will tend to find the GNU version of "tar" to be most convenient (this is what comes standard on Linux systems); with this version, any compressed TAR file can be unpacked as follows:

	tar  xzf  LDC_corpus.tgz

The string "xzf" is a command-line argument that tells the tar program:

  • x -- extract data
  • z -- apply uncompression in the process
  • f -- use the next string on the command line as the name of the file to read/extract from

UNIX users who are stuck with some other version of tar would need to complicate the process slightly (the Wintel version of GNU tar must also use this approach):

	gunzip  -d  -c  LDC_corpus.tgz  |  tar  xf  -

Note the essential dash (" - ") that follows the "tar xf ". (Sorry, but you will need to install the GNU compression utilities gzip and gunzip, if you don't have them already. Might as well get GNU tar while you're at it.)

Text Files

SGML Formatting

Most of our very large text collections (TIPSTER, AQUAINT, TDT, Newswire and Gigaword Corpora) consist of streams of "documents" with SGML (Standard Generalized Markup Language) tags -- similar in nature to HTML and XML. The tags serve to mark document boundaries (e.g. individual news stories), as well as important components or divisions within each document (unique document identifiers, headlines, paragraph boundaries, etc). The names and organization of tags will vary from one corpus to the next, or from one data source to another within a single corpus, but a couple of fundamental rules for markup are consistently observed across all corpora of this sort: the <DOC> tag always marks document boundaries, and the <TEXT> tag is always found within each document to mark the beginning and end of actual text content (separating this from "meta-data" that often accompanies the text, such as headlines and bylines in newswire stories).

Each large text corpus includes documentation that explains the SGML markup format in detail, and a Document Type Definition (DTD) file that can be used to process the data with an SGML parsing utility, such as "nsgmls" (available as part of James Clark's SP Package, widely accepted as "the SGML parser of (p)reference" in research and industry).

In any case, another general rule is that our markup formatting is kept relatively simple and shallow, to make it easy for an automatic process to use, ignore or remove the SGML tagging, and/or filter the data to retain or discard text content according to particular research needs. An SGML parser can be useful, but is not essential; the data can readily be adapted to many uses by means of basic UNIX-style command line utilities (grep, awk, sed, tr), simple operations with any high-level scripting language (Perl, Python, Ruby, etc), or a range of existing software applications that can handle plain-text data in large quantity (concordancers, etc).

Some of the smaller text corpora are not in SGML format, particularly the lexicons and the collections of transcripts from telephone or other miscellaneous speech corpora. The "Hub-4" Broadcast News transcript collections use an SGML format that is significantly different from that of the larger TIPSTER and newswire collections. Again, each corpus includes documentation about the text format being provided.

Text Compression

We often publish large text corpora on CD-ROM with the text files in compressed form. As with TAR files, there are programs available for the most common systems that will handle the uncompression, and they are easy to find and use. UNIX users can simply use "gunzip" on compressed text files.

Depending on your system, your tools and your goals in using the data, the uncompressed output may either be stored on a hard disk for further use, or may be "piped" from gunzip directly to some other process for analysis or conditioning. If you decide to store the uncompressed data on disk, make sure you have enough free space to hold the results: in general, you should expect an uncompressed text file to consume up to 3 times more disk space than the corresponding compressed file.

Speech (Digital Audio) Files

Nearly all LDC speech corpora are published with the speech files in NIST SPHERE format; this involves a simple, flexible and self-describing file header, typically 1024 bytes long, followed by the raw (binary) sample data. The header provides important information (in plain ASCII text) about the speech data in the file, such as the number of samples, the sampling rate, the number of channels, and the kind of sample encoding, as well as whether the speech data are compressed or not (note that the header information itself is never compressed, and so is always readable as plain text).

In many of our very large audio corpora, we use Tony Robinson's "shorten" package to compress the data. Shorten typically provides lossless compression ratios up to 2:1 on 16-bit audio files; when uncompressed, the resulting data file is fully identical to the original data prior to compression (just like in text data compression).

There is software available from NIST for most UNIX systems, providing utilities to manipulate SPHERE files (compress or uncompress using shorten, read or modify header contents, remove the header, extract portions from waveform files, de-multiplex two-channel files, etc).

Of course, people who don't use a UNIX system (or who have UNIX software to do things that NIST SPHERE utilities don't do) will tend to want some other format for the speech files.

The LDC now provides two simple, stand-alone programs that will convert SPHERE speech files to other formats, particularly MS "WAV" (aka RIFF) or headerless (raw) format. These tools automatically determine if a given SPHERE input file has been compressed with "shorten", and uncompress the data for output.

We provide two different file conversion programs, because there are two different strategies that users typically apply when working with large data collections:

  • Convert lots of files at once, as easily and quickly as possible; in this approach, you want to say where to find the original data, where to put the converted data, and what needs to be done in the conversion process. The tool for this approach is called "sph_convert", and is available for Wintel and Macintosh systems.

  • Converts one file at a time, in a simple and consistent manner; in this approach, there is more flexibility about where the output goes (redirect it to a chosen file, or feed it directly, via a pipeline command, to some other process), and there are more options for controlling the conversion process. The tool for this approach is called "sph2pipe", and is available for Wintel and UNIX systems.

Note that both tools have been designed to handle all types of SPHERE audio data automatically: "shorten" compressed or not, single or dual channel, 16-bit PCM or 8-bit mu-law, any sampling rate. The tools determine the properties of each file by reading the SPHERE header, so all the user needs to do is specify how the output data should be structured, and where it should go. Options for controlling output include conversion of mu-law data to PCM, selecting one channel from a two-channel input file. Follow the links below to see the "readme" files for these programs, and to download the appropriate version (these links all point to files at ftp://ftp.ldc.upenn.edu/pub/misc_sw/):

sph_convert_v2_0.README_1ST

sph_convert_v2_1.zip (for Wintel)

sph_convert_v2_0.sit (for Mac OS9 or "Classic MacOS" users)

sph2pipe_v2.5.README_1ST

sph2pipe_v2.5.tar.gz (for UNIX, MacOSX and Wintel)

Note that the LDC tools (and the NIST and other utilities mentioned above) do not support sample-rate conversion. Another, more powerful tool for waveform file conversion, the "SoX" utility maintained at SourceForge.net, is useful for this is:

http://sox.sourceforge.net/

The current SoX package includes both source code (suitable for UNIX or Wintel) and a compiled executable (for Wintel). The current version knows how to read SPHERE files as input -- but it cannot handle files that are shorten-compressed. In addition, it provides a wide range of possible output options, including many alternative file formats (AU, AIFF, and more), changes in sampling rate, and even some nifty sound-effects (if you happen to want that sort of thing).

In terms of other software packages for handling the LDC's compressed speech data, the official source for the most current, best-supported version of shorten for Wintel systems is http://www.etree.org/shncom.html. This version provides the ability to compress your own speech data, as well as uncompressing data from the LDC. For users of MacOSX and Linux systems, http://www.hornig.net/shorten.html should provide equivalent functionality.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Tuesday, 19-Jul-2011 21:51:28 EDT
© 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.