This two CD-ROM set contains data from transcribed news broadcasts,
designated for use in the baseline language model (LM) for the 1996
CSR HUB4 Evaluation.
The LDC obtained the bulk of the data from broadcast news CD-ROMs
produced by Primary Source Media, Inc. This portion includes the
period from January 1992 to April 1996 and contains approximately one
gigabyte of data uncompressed. This release also includes about 36
megabytes of material received on floppy disks covering the period
from late May through June 1996, with somewhat different format from
the bulk of the data.
The text data are presented in two forms: (1) a relatively unprocessed
("raw" or "sentence-tagged") form and (2) a fully processed
("conditioned," "verbalized-punctuation") form. The "raw" form
includes the header and footer information accompanying the articles,
such as network, show name, headline, copyright, credits and so
forth; the text and ancillary data are presented in a fairly
consistent (though simple) SGML format. The "processed" form contains
only the text content of the articles, together with SGML tags to mark
the boundaries of articles, paragraphs and sentences; the text content
has been modified by replacing numeric strings (dates, dollar amounts,
quantities) with orthographic strings (e.g. "nineteen ninety six"),
replacing abbreviations ("Inc.," "Ltd.," "Corp.," etc.) with
corresponding full-word forms and replacing punctuation characters
with corresponding word tokens (e.g. "," becomes "COMMA"). This
release also includes an archive of the tools used to create the
"processed" form of the data.
There are no updates at this time.
The Reduced Licensing Fee for this corpus is US$200.