Using LDC Data

LDC distributes corpora in two ways: on physical media (e.g., DVD, hard drives) or in web downloadable compressed tar files (.tgz or .tar.gz extensions). The distribution method relies mainly on corpus size. While web downloads are always a single compressed file, corpora delivered on media may also be compressed either in part or in full.

Extraction of Compressed Files

There are a variety of options for unpacking compressed files. Unix-like systems generally include a pre-installed tar program. The recommended way to unpack a file with that program is the following command entered in a terminal or command line window:

tar xzf LDC_corpus.tgz

The string “xzf” is a command-line argument that tells the tar program to:

  • x – extract data
  • z – apply uncompression in the process
  • f – use the following file to read/extract from

Windows systems typically do not contain a native intalled program to handle tar files. Windows users should utilize third party software such as 7-zip for that purpose. Alternatively, a UNIX/Linux emulation program like Cygwin provides Windows users access to UNIX- based programs such as tar.

Data

LDC makes every effort to maintain a uniform basic file structure for all corpora. The preferred structure is as follows:

A root folder, generally given a shortened version of the corpus title and a disc number if necessary, contains the following sub-directories:

  • data: The majority of the data provided in the corpus. Sub-directories are used to split the data into sections such as speech and transcripts.
  • docs: Readmes, academic papers, a file list and any other documentation helpful to the user.
  • dtd: In the event that data contains files referencing a particular dtd (e.g., xml, sgml), it is contained within this folder.
  • tools: Any software, scripts or other tools distributed with the corpus.

All corpora contain an index.html file within the root folder that consists of a summary of the data, its directory structure and other pertinent information. File.tbl contains a tab-delimited list of all files, their size, last access date and md5 checksum. In older corpora, this file may be a file list and checksums or merely a file list. 

Text Data

Most text data is released in SGML or XML format. LDC prefers simple and shallow formatting to make it easy for an automatic process to use, ignore or remove the markup tagging. This also  aids users in filtering data to retain or discard text content for particular research needs. The DTD explains the markup format in detail and is also used to validate the corpus using either onsgmls for SGML or xmllint for XML.

Speech (Audio) Data

Speech data is released in NIST SPHERE, FLAC, MS WAV or MP3 format. Data in very large SPHERE corpora is compressed using shorten. All audio files are checked to make sure they contain valid headers.

NIST provides software to manipulate SPHERE audio files. Some users who may not have a UNIX system (or wish to do things that the NIST utilities cannot do) may want SPHERE files in another format. LDC provides two simple, stand-alone programs that will convert SPHERE speech files to other formats, particularly MS WAV or header-less (raw) format. These tools will also work with “shortened” SPHERE files. More information about those programs can be found on the Tools page.

A powerful tool for un-shortened SPHERE files is the SoX utility. To use SoX on “shortened” SPHERE files, one must first convert them using the tools mentioned above.

Video Data

Video data is published in the following formats: AVI, DV, MPEG-PS (VOB), MPEG-TS, MOV and MP4. All video data is checked using ffprobe (a derivation of the ffmpeg program) to ensure validity. Ffmpeg is also a good utility for users wishing to manipulate video data.

More Resources

Visit the Tools page for LDC-developed software packages that may facilitate data use. Users should also consult corpus documentation for information about a particular data set. For addtional assistance with data questions, contact LDC.