Introduction
Brown Laboratory for Linguistic Information Processing
(BLLIP) North American News Text, Complete, LDC2008T13, isbn 1-58563-482-4,
contains a Penn Treebank-style parsing of approximately 24 million sentences
from the North
American News Text Corpus (LDC95T21). The North American News Text Corpus
consists of English news text from the Los Angeles Times-Washington Post (1994-1997),
the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall
Street Journal (1994-1996).
BLLIP North American News Text is released in two versions: BLLIP North American
News Text, Complete (LDC2008T13), a members-only corpus that contains
sentences from all sources in The North American News Text Corpus; and BLLIP
North American News Text, General Release (LDC2008T14), a corpus available to
nonmembers that does not include the Wall Street Journal data from The North
American News Text Corpus.
To complement the Complete and General Release versions of BLLIP North American
News Text, LDC is re-releasing The North American News Text Corpus in two versions.
North American News Text, Complete LDC2008T15, the members-only original version,
is now available as a 2008 Membership Year corpus. North American News Text, General
Release (LDC2008T16) (which does not include news text from the Wall Street
Journal), is available to nonmembers for the first time. The directory structures
of each of these publications has been restructured to be identical to the directory
structure of the BLLIP releases.
Methodology
A key problem in natural language processing is syntactic ambiguity resulting
from uncertain relationships between words and their connections to sentence
clauses. Sentences that can be constructed with correct syntax in more than
one way are ambiguous, and such sentences generate multiple parse trees when
they are separated into clauses by parts of speech.
Traditional parsing techniques, such as part-of-speech (POS) tagging, typically
achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving
ambiguous sentences requires a probabilistic approach. Using the relative frequencies
of grammar rules, statistical processing techniques assign probabilities for
each clause. These probabilities are then summed up over each complete sentence
parse and a probability is assigned for that sentence parse. In that way, the
most likely parse can be determined.
The data in this release was parsed into Penn Treebank-style parse trees using
a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak
and Johnson parser is statistically-based and uses a generative first stage
followed by a discriminative second stage. Both stages were trained on the Wall
Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP
1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style
parsing of that Wall Street Journal material.
In order to produce BLLIP North American News Text, the Charniak-Johnson parser
used a simplified context free grammar in the first stage to generate a set
of n best parses. Those parses were then pruned by eliminating the
parses at the edges of the distribution. In the second stage, a maximum entropy-based
parser using a complete grammar was applied. The output trees are ranked in
order of probability.
Data
The parses in BLLIP North American News Text include constituency and POS tagging
information for each of the 50-best parses of each sentence.
Each file contains a sequence of n-best lists. An n-best list is a list of
the top n parses of each sentence with the corresponding parser probability
and re-ranker score. Following is an example of a simple n-best list:
50 reute9406_007.0356_13
4.9244 -147.337
(S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP
(RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS
institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government)
(CC and) (NN parliament))))))))))) (. .)))
3.56482 -151.575
(S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP
(RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the)
(NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN
government) (CC and) (NN parliament)))))))))) (. .)))
3.35952 -151.173
(S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP
(RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS
institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government)
(CC and) (NN parliament)))))))))))) (. .)))
2.67662 -148.374
(S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP
(ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS
institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government)
(CC and) (NN parliament))))))))))) (. .)))
In the above example, the first number ("50") indicates the number
of parses. The next token is the article id from the North American News Text
Corpus ("reute9406_007.0356"), followed by an underscore, followed
by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here.
Each parse consists of a reranker score (4.9244 for the first parse) and parser
log probability (-147.337 for the first parse), a new line, and then the parse
tree itself. Parse trees are given in Penn Treebank format. Note that the n-best
list is sorted by decreasing reranker scores.
Source material is as follows:
| Source |
Dates |
Approx. # Words (millions) |
| Los Angeles Times & Washington Post |
1994-1997 |
52 |
| New York Times |
1994-1996 |
173 |
| Reuters (General and Financial) |
1994-1996 |
85 |
| Wall Street Journal (Not included in General Release) |
1994-1996 |
40 |
Pricing
The Reduced Licensing Fee for this corpus is US$800.
Content Copyright
Portions © 1994-1996 Dow Jones & Company, Inc., © 1994-1997 Los Angeles Times-Washington Post News Service, Inc.,
© 1994-1996 New York Times, © 1994-1996 Reuters America, Inc., ©
1995-1997, 2008 Trustees of the University of Pennsylvania |