Using LDC Data

LDC distributes corpora in two ways: on physical media (e.g. DVD, hard drives) or in web downloadable compressed files (.tgz, .tar.gz or .zip extensions). The distribution method relies mainly on corpus size. While web downloads are always one or more compressed files, corpora delivered on media may also be compressed either in part or in full.

Extraction of Compressed Files

There are a variety of options for unpacking compressed files. Unix-like systems generally include a pre-installed tar [1] program. The recommended way to unpack a file with that program is the following command entered in a terminal or command line window:

tar xzf LDC_corpus.tgz

The string “xzf” is a command-line argument that tells the tar program to:

x – extract data
z – apply uncompression in the process
f – use the following file to read/extract from

Very large corpora may be available as a multi-part zip download. In this case, all files must be downloaded to access the corpus. Their extensions will be *.zip.XXX where XXX is the part number. The command line for using 7-zip to extract multi-part archives is as follows:

7z x LDC_corpus.zip.001

x - extract data with full path

Note that both on the command line and with GUI programs, it is only necessary to open the *.zip.001 file. The compression program will automatically extract the entirety of the corpus from all parts. There are a variety of programs to decompress tar and multi-part zip files. For handling multi-part zip files, LDC recommends 7-zip [2] on Windows, Keka [3] on MACOS, and p7zip [4]on Linux.

Data

LDC makes every effort to maintain a uniform basic file structure for all corpora. The preferred structure is as follows:

A root folder, generally given a shortened version of the corpus title and a disc number if necessary, contains the following sub-directories:

data: The majority of the data provided in the corpus. Sub-directories are used to split the data into sections such as speech and transcripts.
docs: Readmes, academic papers, a file list and any other documentation helpful to the user.
dtd: In the event that data contains files referencing a particular dtd (e.g., xml, sgml), it is contained within this folder.
tools: Any software, scripts or other tools distributed with the corpus.

All corpora contain an index.html file within the root folder that consists of a summary of the data, its directory structure and other pertinent information. File.tbl contains a tab-delimited list of all files, their size, last access date and md5 checksum. In older corpora, this file may be a file list and checksums or merely a file list.

Note that any updates to a corpus will be indicated on the associated catalog entry under an “Updates” section. New distributions of a corpus will always be fully up to date. Contact LDC for any questions about previous versions. Major updates will be given a new catalog ID and released as a new version.

The catalog metadata field “Related Works” is a set of controlled terms for relations between corpora. All LDC publications have been catalogued for related works, though not all corpora have relations. The schema [5] is based on a taxonomy developed by META-SHARE and LDC, with modifications relevant to LDC’s language resources.

Text Data

Most text data is released as plain text or in XML format. LDC prefers simple and shallow formatting to make it easy for an automatic process to use, ignore or remove the markup tagging. This also aids users in filtering data to retain or discard text content for particular research needs. The DTD explains the markup format in detail and is also used to validate the corpus using either onsgmls [6] for SGML or xmllint [7] for XML.

Speech (Audio) Data

Speech data is released in NIST SPHERE, FLAC, MS WAV or MP3 format. Data in very large SPHERE corpora is compressed using shorten. All audio files are checked to make sure they contain valid headers.

NIST provides software [8] to manipulate SPHERE audio files. Some users who may not have a UNIX system (or wish to do things that the NIST utilities cannot do) may want SPHERE files in another format. LDC provides two simple, stand-alone programs that will convert SPHERE speech files to other formats, particularly MS WAV or header-less (raw) format. These tools will also work with “shortened” SPHERE files. More information about those programs can be found on the Tools page [9].

A powerful tool for un-shortened SPHERE files is the SoX [10] utility. To use SoX on “shortened” SPHERE files, one must first convert them using the tools mentioned above.

Video Data

Video data is published in the following formats: AVI, DV, MPEG-PS (VOB), MPEG-TS, MOV and MP4. All video data is checked using ffprobe [11] (a derivation of the ffmpeg program) to ensure validity. Ffmpeg [12] is also a good utility for users wishing to manipulate video data.

More Resources

Visit the Tools [9] page for LDC-developed software packages that may facilitate data use. Users should also consult corpus documentation for information about a particular data set. For addtional assistance with data questions, contact LDC [13].