Brown Laboratory for Linguistic Information Processing
(BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style parsing
of the three-year Wall Street Journal (WSJ) collection from the ACL/DCI
corpus, approximately 30 million words. The parsing and
part-of-speech (POS) tagging for the entire archive were done
using statistically-based methods developed by Eugene Charniak,
Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson of BLLIP.
This corpus both overlaps and supplements the million-word Penn
Treebank (PTB) collection of parsed and POS-tagged WSJ texts.
The PTB project selected 2,499 stories from a three-year WSJ
collection of 98,732 stories for syntactic annotation. These 2,499
stories have been distributed in both Treebank-2 (LDC1999T42) and
Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the
raw text for each story. Three "
map" files are available in a compressed file via ftp and
provide the relation between the 2,499 PTB filenames and the
corresponding WSJ DOCNO strings in TIPSTER.
There are no updates at this time.
Portions © 1987-1989 Dow Jones & Company, Inc., © 2000 Trustees of the University of Pennsylvania