Introduction
The 2001 Topic Annotated Enron Email Data Set contains approximately 5000 (4936)
emails from Enron Corporation (Enron) manually indexed into 32 topics. It is
a subset of the original Enron Email Data Set of 1.5 million emails that was
posted on the Federal
Energy Regulatory Commission website as a matter of public record during
the investigation of Enron. The original set suffered from document integrity
problems; attempts were made to improve the quality of the data and to remove
some sensitive and private information. Dr. William Cohen of Carnegie
Mellon University took the lead in distributing the improved corpus, consisting
of 517,431 Enron employee emails that covered the period 1999-2002.
This corpus is a subset of the Carnegie Mellon data set and covers the period
from January 2001 to December 2001. The email topics reflect the business activities
and interests of Enron employees in that year: California energy problems and
the subsequent state and Federal investigations, Enron's downfall (newsfeeds
and interoffice communications), Enron's venture with the Dabhol India Power
Company, Enrononline (Enron's trading infrastructure), competitors (Dynegy,
El Paso Pipeline) and even fantasy football and college football. Eliminated
from this data set are duplicates, emails that are too small and emails that
are not really topics but are types (personnel memos and personal quips). The
manual indexing was performed in the summer of 2006 by two people who worked
closely together: a research associate familiar with the Enron saga and a junior
in economics at the University of Tennessee.
The original Enron Email Data Set is the first large email set made available
to researchers, but until now there has been no ability to assess the performance
of topic detection and tracking algorithms with the email set. Having an annotated
subset such as this one should provide text mining researchers with a way to
evaluate the accuracy of new algorithms for clustering and classification. This
data set can also be used to provide communication context for researchers using
the Enron Email Data Set in social network analysis. Previous annotations such
as the one developed at UC
Berkeley have been primarily based on email type rather than the specific
topic(s) of discussion. This annotation can be used to qualify the discussion
topics between individuals and groups comprising a social network of Enron employees.
Due to the complexity of this corpus' directory structure, it will be distributed as compressed tar file on a cd. Most compression utilities will uncompress the package.
Updates
An update is available via web download. This update corrects a small error in the subjection annotation file. Those members and customers who received this publication prior to Aug 13, 2007 should download this correction. All copies issued since this date have been corrected.
Content Copyright
Portions © 2006, 2007 Dr. Michael W. Berry, © 2007 Trustees of the University of Pennsylvania |