Introduction
The New York Times Annotated Corpus contains over 1.8 million articles written and published
by the New York Times between January 1, 1987 and June 19, 2007 with article
metadata provided by the New York Times Newsroom, the New York Times Indexing
Service and the online production staff at nytimes.com.
The corpus includes:
- Over 1.8 million articles (excluding wire services articles that appeared
during the covered period).
- Over 650,000 article summaries written by library scientists.
- Over 1,500,000 articles manually tagged by library scientists with tags
drawn from a normalized indexing vocabulary of people, organizations, locations
and topic descriptors.
- Over 275,000 algorithmically-tagged articles that have been hand verified
by the online production staff at nytimes.com.
- Java tools for parsing corpus documents from .xml into a memory resident
object.
As part of the New York Times' indexing procedures, most articles are manually
summarized and tagged by a staff of library scientists. This collection contains
over 650,000 article-summary pairs which may prove to be useful in the development
and evaluation of algorithms for automated document summarization. Also, over
1.5 million documents have at least one tag. Articles are tagged for persons,
places, organizations, titles and topics using a controlled vocabulary that
is applied consistently across articles. For instance if one article mentions
"Bill Clinton" and another refers to "President William Jefferson
Clinton", both articles will be tagged with "CLINTON, BILL".
The New York Times has established a community website for researchers working on the data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion about the corpus.
Data
The text in this corpus is formatted in News
Industry Text Format (NITF) developed by the International Press Telecommunications
Council, an independent association of news agencies and publishers. NITF is
an XML specification that provides a standardized representation for the content
and structure of discrete news articles. NITF encompasses structural markup
such as bylines, headlines and paragraphs. The format also provides management
attributes for categorizing articles into topics, summarization usage restrictions
and revision histories. The goals of NITF are to answer the essential questions
inherent in news articles: Who, What, When, Where and Why.
- Who: Who owns the copyright, who has rights to republish
the article and who the article is about.
- What: The subjects reported, the named entities inside
the article and the events it describes.
- When: When the article was written, when it was issued
and when it was revised.
- Where: Where the article was written, where the events
took place and where it was delivered.
- Why: The metadata describing the newsworthiness of the
article.
Samples
Presented below is an image of the abstract of one article. To see a larger view, click on the image.
Updates
A revised manual is now available.
Content Copyright
Portions © 1987-2008 New York Times, © 2008 Trustees of the University
of Pennsylvania |