| Original release was:
LDC Catalog No.: LDC94T4B-3.1
NIST Catalog No.: NA
LDC Release date: 4/94 (MY94)
original treebank release
This CD-ROM contains over 1.6 million words of hand-parsed material
from the Dow Jones News Service, plus an additional one million words
tagged for part-of-speech. This material is a subset of the language
model corpus for the DARPA CSR large-vocabulary speech recognition
project.
It also contains the first fully parsed version of the Brown Corpus,
which has also been completely retagged using the Penn Treebank (PTB) tag
set. Also included are tagged and parsed data from Department of
Energy abstracts, IBM computer manuals, MUC-3 and ATIS.
In addition, the CD-ROM includes source code for programs that were
used by the PTB project in creating portions of the data.
Source code is also included for "tgrep," a program that permits
the user to search for specific constituents in tree structures. All
software is provided "as is." (We have learned since publication that
the tgrep source code provided on the cd-rom is not readily portable,
and compiling tgrep requires modification of the source files. The
CD-ROM does include a pre-compiled program file for tgrep, built for
use on Sun sparc systems.)
Release - 2
The PTB Project Release 2 CD-ROM features the new PTB-2 bracketing style, which is designed to allow the
extraction of simple predicate/argument structure. Over one million
words of text are provided with this bracketing applied, along with
a complete style manual explaining the bracketing and new versions
of tools for searching and treating bracketed data.
This CD-ROM also contains all the annotated text material from the
earlier Treebank Preliminary Release, including the Brown Corpus.
While these materials have not all been converted to the newer
bracketing style, they have been cleaned up to remove problems that
had appeared in the earlier release.
The contents of Treebank Release 2 are as follows:
-
One million words of 1989 Wall Street Journal material annotated in
Treebank-2 style.
- A small sample of ATIS-3 material annotated in Treebank-2 style.
- 300-page style manual for Treebank-2 bracketing, as well as the
part-of-speech tagging guidelines.
- The contents of the previous Treebank CD-ROM (Version 0.5), with
cleaner versions of the WSJ, Brown Corpus, and ATIS material
(annotated in Treebank-1 style).
- Tools for processing Treebank data, including "tgrep," a
tree-searching and manipulation package (note that usability of this
release of tgrep is limited: users of Sun sparc systems should have no
problem, but others may find the software to be difficult or
impossible to port).
In addition, the PTB Project has provided some updates,
announcements and a discussion forum for users. A file of updates and
further information is available via anonymous FTP from
ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2.
The PTB project selected 2,499 stories from a three year Wall
Street Journal (WSJ) collection of 98,732 stories for syntactic
annotation. These 2,499 stories have been distributed in both Treebank-2
(LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files
are available in a compressed file via ftp and provide the relation between
the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Detailed questions about the corpus may be sent to
treebank@ldc.upenn.edu, while
questions
and requests for obtaining Treebank Release 2 should be sent to
member-service@ldc.upenn.edu. |