Introduction
Brown Laboratory for Linguistic Information Processing
(BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style parsing
of the three-year Wall Street Journal (WSJ) collection from the ACL/DCI
corpus, approximately 30 million words. The parsing and
part-of-speech (POS) tagging for the entire archive were done
using statistically-based methods developed by Eugene Charniak,
Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.
This corpus both overlaps and supplements the million-word Penn
Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
Data
The PTB project selected 2,499 stories from a three-year WSJ
collection of 98,732 stories for syntactic annotation. These 2,499
stories have been distributed in both Treebank-2 (LDC1999T42) and
Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the
raw text for each story. Three "
map" files are available in a compressed file via ftp and
provide the relation between the 2,499 PTB filenames and the
corresponding WSJ DOCNO strings in TIPSTER.
Updates
There are no updates at this time.
Copyright
Portions © 1987-1989 Dow Jones & Company, Inc., © 2000 Trustees of the University of Pennsylvania |