| The ACL Data Collection Initiative disc contains text from: Wall
Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones,
Inc.; the Collins English Dictionary, Copyright 1979, William Collins
Sons Co., Ltd.; scientific abstracts provided by the U.S.
Department of Energy and a variety of grammatically tagged and parsed
materials from the Treebank project at the University of Pennsylvania,
copyright 1990, 1991, University of Pennsylvania. The total amount of
uncompressed text is 620 Mbytes.
The many formats in which the originals of these texts came have all,
to one extent or another, been mapped into a markup language
consistent with the SGML standard (ISO 8879).
The format of the material from the Wall Street Journal uses a
labelled bracketing, expressed in the style of SGML, although no
formal SGML DTD is provided. The tag set has been modified by turning
the Dow Jones header categories into tags and by creating ad hoc tags
such as "". The original datelines are presented as
separate text units; the text is divided and tagged into paragraphs
and sentences with each sentence presented on a single line. Nothing
has been done to modify the typographical methods used to subdivide
headlines and stories into sections, nor are any of the text features
within sentences (quotes, ellipsis, etc.) normalized.
The Collins English Dictionary is present in two forms. One form was
approximately parsed into fielded records as an exercise in learning a
language called "FIT", by a student working under the direction of
Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990.
The original digital image of the typographer's tape that the database
version was prepared from had serious flaws that were not detected and
corrected until later; the corrected version, a clean typographer's
tape, is presented in a separate directory. A properly-analyzed
database version will be provided in the future. The documentation
includes notes developed during the new attempt to analyze the tape
from scratch.
The Department of Energy abstracts reside in files that are
approximately one megabyte each. The original 950 separators have
been replaced with newlines and space padding between articles was
removed. An acronym dictionary that was extracted from the database
as an indication of the material's topic areas has been included in a
separate directory.
Provisional material from the Penn Treebank project is divided into
two subdirectories on this disk. The subdirectory "postext" contains
text with part-of-speech annotations; "parstext" contains text with
syntactic bracketing.
Content Copyright
Portions © 1987-1989 Dow Jones & Company, Inc., © 1979 William Collins Sons Co., Ltd., © 1990, 1991, 1993 Trustees of the University of Pennsylvania |