![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
||||
|
|
Linguistic Resources Using LDC Data
The LDC distributes corpora in two ways: CD- or DVD-ROM's shipped to
users, and UNIX TAR files made available to users web download through the LDC Intranet. CD-ROM's and DVD-ROM's contain text and/or
speech data, presented in a machine-independent directory structure
(ISO 9660); these discs are the only practical media for data
collections involving hundreds or thousands of megabytes (MB).
TAR files via web download are only practical for relatively small
data sets (250 MB or less);
these are typically text-only, such as transcriptions or lexicons.
(Sometimes, we produce CD-ROM's that contain TAR files.)
Whether the distribution involves speech or text, via CD-ROM or web download,
we often apply data compression to make the package smaller and easier
to distribute. This page describes tools and procedures helpful for
handling the data distributed by the LDC.
Extraction from
TAR Files
Software to uncompress and unpack a UNIX TAR file is available for the
most common computer systems (Microsoft DOS/Windows and Macintosh).
If you have a non-UNIX machine and you don't yet have a program to
handle TAR files, search the web (www.gnu.org, www.simtel.net, www.aladdinsys.com are fairly
stable sources). Some sources provide free downloads, others provide
"shareware" (try now, pay later), and a few try to do a regular retail
business with modest purchase prices.
GNU Software for MS-Windows and MS-DOS on CD-ROM is now also available
upon request from the LDC, and we highly recommend this package for
simplicity and ease of use. Another good (free) source that provides
a fairly complete UNIX/Linux emulation package for MS-Windows is Cygwin: www.cygwin.com
The instructions to follow on non-UNIX machines will depend on which
"brand" of software is being used -- read the manual. (Don't worry,
Mac and Wintel applications usually try very hard to make the process
as simple as possible.)
UNIX users will tend to find the GNU version of "tar" to be most
convenient (this is what comes standard on Linux systems); with this
version, any compressed TAR file can be unpacked as follows:
The string "xzf" is a command-line argument that tells the tar
program:
UNIX users who are stuck with some other version of tar would need to
complicate the process slightly (the Wintel version of GNU tar must
also use this approach):
Note the essential dash (" - ") that follows the "tar xf ". (Sorry,
but you will need to install the GNU compression utilities gzip and
gunzip, if you don't have them already. Might as well get GNU tar
while you're at it.)
Text
Files
Most of our very large text collections (TIPSTER, AQUAINT, TDT,
Newswire and Gigaword Corpora) consist of streams of "documents" with
SGML (Standard Generalized Markup Language) tags -- similar in nature
to HTML and XML. The tags serve to mark document boundaries (e.g.
individual news stories), as well as important components or divisions
within each document (unique document identifiers, headlines,
paragraph boundaries, etc). The names and organization of tags will
vary from one corpus to the next, or from one data source to another
within a single corpus, but a couple of fundamental rules for markup
are consistently observed across all corpora of this sort: the <DOC> tag
always marks document boundaries, and the <TEXT> tag is always
found within each document to mark the beginning and end of actual
text content (separating this from "meta-data" that often accompanies
the text, such as headlines and bylines in newswire stories).
Each large text corpus includes documentation that explains the SGML
markup format in detail, and a Document Type Definition (DTD) file
that can be used to process the data with an SGML parsing utility,
such as "nsgmls" (available as part of James Clark's SP Package, widely accepted as
"the SGML parser of (p)reference" in research and industry).
In any case, another general rule is that our markup formatting is
kept relatively simple and shallow, to make it easy for an automatic
process to use, ignore or remove the SGML tagging, and/or filter the
data to retain or discard text content according to particular
research needs. An SGML parser can be useful, but is not essential;
the data can readily be adapted to many uses by means of basic
UNIX-style command line utilities (grep, awk, sed, tr), simple
operations with any high-level scripting language (Perl, Python, Ruby,
etc), or a range of existing software applications that can handle
plain-text data in large quantity (concordancers, etc).
Some of the smaller text corpora are not in SGML format, particularly
the lexicons and the collections of transcripts from telephone or
other miscellaneous speech corpora. The "Hub-4" Broadcast News
transcript collections use an SGML format that is significantly
different from that of the larger TIPSTER and newswire collections.
Again, each corpus includes documentation about the text format being
provided.
We often publish large text corpora on CD-ROM with the text files in
compressed form. As with TAR files, there are programs available for
the most common systems that will handle the uncompression, and they
are easy to find and use. UNIX users can simply use "gunzip" on
compressed text files.
Depending on your system, your tools and your goals in using the data,
the uncompressed output may either be stored on a hard disk for
further use, or may be "piped" from gunzip directly to some other
process for analysis or conditioning. If you decide to store the
uncompressed data on disk, make sure you have enough free space to
hold the results: in general, you should expect an uncompressed text
file to consume up to 3 times more disk space than the corresponding
compressed file.
Speech (Digital
Audio) Files
Nearly all LDC speech corpora are published with the speech files in
NIST SPHERE format; this involves a simple, flexible and
self-describing file header, typically 1024 bytes long, followed by
the raw (binary) sample data. The header provides important
information (in plain ASCII text) about the speech data in the file,
such as the number of samples, the sampling rate, the number of
channels, and the kind of sample encoding, as well as whether the
speech data are compressed or not (note that the header information
itself is never compressed, and so is always readable as plain text).
In many of our very large audio corpora, we use Tony Robinson's
"shorten" package to compress the data. Shorten typically provides
lossless compression ratios up to 2:1 on 16-bit audio files; when
uncompressed, the resulting data file is fully identical to the
original data prior to compression (just like in text data
compression).
There is software
available from NIST for most UNIX systems, providing utilities to
manipulate SPHERE files (compress or uncompress using shorten, read or
modify header contents, remove the header, extract portions from
waveform files, de-multiplex two-channel files, etc).
Of course, people who don't use a UNIX system (or who have UNIX
software to do things that NIST SPHERE utilities don't do) will tend
to want some other format for the speech files.
The LDC now provides two simple, stand-alone programs that will
convert SPHERE speech files to other formats, particularly MS "WAV"
(aka RIFF) or headerless (raw) format. These tools automatically
determine if a given SPHERE input file has been compressed with
"shorten", and uncompress the data for output.
We provide two different file conversion programs, because there are
two different strategies that users typically apply when working with
large data collections:
Note that both tools have been designed to handle all types of SPHERE
audio data automatically: "shorten" compressed or not, single or dual
channel, 16-bit PCM or 8-bit mu-law, any sampling rate. The tools
determine the properties of each file by reading the SPHERE header, so
all the user needs to do is specify how the output data should be
structured, and where it should go. Options for controlling output
include conversion of mu-law data to PCM, selecting one channel from a
two-channel input file. Follow the links below to see the "readme"
files for these programs, and to download the appropriate version
(these links all point to files at ftp://ftp.ldc.upenn.edu/pub/misc_sw/):
sph_convert_v2_1.zip (for Wintel)
sph_convert_v2_0.sit (for Mac OS9 or "Classic MacOS" users)
sph2pipe_v2.5.tar.gz (for UNIX, MacOSX and Wintel)
Note that the LDC tools (and the NIST and other utilities mentioned
above) do not support sample-rate conversion. Another,
more powerful tool for waveform file conversion, the "SoX" utility
maintained at SourceForge.net, is useful for this is:
The current SoX package includes both source code (suitable for UNIX or Wintel)
and a compiled executable (for Wintel). The current version knows how
to read SPHERE files as input -- but it cannot handle files that are
shorten-compressed. In addition, it provides a wide range of possible
output options, including many alternative file formats (AU, AIFF, and
more), changes in sampling rate, and even some nifty sound-effects (if
you happen to want that sort of thing).
In terms of other software packages for handling the LDC's compressed
speech data, the official source for the most current, best-supported
version of shorten for Wintel systems is
http://www.etree.org/shncom.html. This version provides the
ability to compress your own speech data, as well as uncompressing
data from the LDC. For users of MacOSX and Linux systems,
http://www.hornig.net/shorten.html should provide equivalent
functionality.
|
|||
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
Contact ldc@ldc.upenn.edu | ||||