Introduction
North American News Text, Complete, Linguistic
Data Consortium (LDC) catalog number LDC2008T15 and isbn 1-58563-483-2, is a
collection of English news text from the Los Angeles Times, Washington Post,
New York Times, Reuters and the Wall Street Journal. This corpus was originally
released in 1995 as the North
American News Text Corpus (LDC95T21) and is reissued to complement the
release of the Brown Laboratory for Linguistic Information Processing (BLLIP)
North American News Text sets (LDC2008T13, LDC2008T14), which consist of Penn
Treebank-style parsing of that news text.
North American News Text is reissued in two versions: North American News Text,
Complete LDC2008T15, the members-only original version, now available as a 2008
Membership Year corpus; and North American News Text, General Release LDC2008T16
(which does not include text from the Wall Journal Street Journal), available
to nonmembers for the first time. The directory structure of each of these publications
has been restructured to be identical to the directory structure of the BLLIP
releases.
Data
The table below contains a breakdown of the sources, epochs and word counts
for the data in the North American News Text releases:
| Source |
Dates |
# Words (millions) |
| Los Angeles Times & Washington Post |
May, 1994 - August 1997 |
52 |
| New York Times News & Syndicate |
July, 1994 - December 1996 |
173 |
| Reuters News Service (General and Finanical) |
April, 1994 - December 1996 |
85 |
| Wall Street Journal (not in General Release) |
July, 1994 - December 1996 |
40 |
The New York Times and the Los Angeles Times/Washington Post services include
a range of other newspaper sources in their syndicated newswires. The Los Angeles
Times/Washington Post material in this corpus includes some news text from the
following sources:
- Newsday
- The Baltimore Sun
- The Hartford Courant
The New York Times material in this corpus contains some data from the following
sources, although New York Times articles predominate:
- Bloomberg Business News
- The Boston Globe
- Los Angeles Daily News
- Fort Worth Star-Telegram
- Newsweek
- Cox News Service
- The Arizona Republic
- Seattle Post-Intelligencer
- San Francisco Examiner
- Houston Chronicle
- San Francisco Chronicle
- Economist Newspaper Ltd.
- Hearst Newspapers
The text content of each data file (following uncompression with the GNU-unzip
utility) consists of plain ASCII character data with SGML tags to indicate article
boundaries and organization of information within each article.
There are differences among the five primary newswire sources in terms of the
number and types of SGML tags used in the text, but the following tag structure
is common to all data sets:
-- start of a new article
... -- some variety of "header" tags appears here
-- start of the text content of the article
-- all paragraph boundaries are marked by this tag
... -- text data as it is provided by the newswire service
-- end of text content of the article
... -- some variety of "trailer" tags appears here
-- end of article
In general, the differences in format among the various newswire sources will
be found in the SGML tags that appear between and ,
and those that appear between and . The actual text
content of articles (the region between and ) is consistent
in format across sources, except for some uses of the SGML "&..;"
notation to represent special characters in the data. For example, "&MD;"
is used in the "latwp" material to represent the "em-dash"
character, which is typically used to separate the "dateline" from
the opening sentence in the first paragraph of each article. There may also
be differences in how quotation marks are rendered.
As this re-release is intended to complement the BLLIP North American News
Text releases, the directory structure of this corpus is identical to that of
the BLLIP publications.
Pricing
The Reduced Licensing Fee for this corpus is US$200.
Content Copyright
Portions ©1994-1996 Dow Jones & Company, Inc., © 1994-1997 Los
Angeles Times-Washington Post News Service, Inc., © 1994-1996 New York
Times, © 1994-1996 Reuters America, Inc., © 1995-1997, 2008 Trustees
of the University of Pennsylvania |