|LDC93T3A - Complete TIPSTER corpus |
LDC93T3B - Volume 1 of the TIPSTER corpus
LDC93T3C - Volume 2 of the TIPSTER corpus
LDC93T3D - Volume 3 of the TIPSTER corpus
The TIPSTER project is sponsored by the Software and Intelligent
Systems Technology Office of the Advanced Research Projects Agency
(ARPA/SISTO) in an effort to significantly advance the state of the
art in effective document detection (information retrieval) and data
extraction from large, real-world data collections.
The detection data is comprised of a new test collection built at NIST
to be used both for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval
research groups, working on the same task as the TIPSTER groups, but
meeting once a year in a workshop to compare results (similar to MUC).
The test collection consists of three CD-ROMs of SGML encoded documents
distributed by LDC plus queries and answers (relevant documents)
distributed by NIST.
SourceYEAR Approx. # Words (Millions)
San Jose Mercury199145
The documents in the test collection are varied in style, size and
subject domain. The first disk contains material from the Wall Street
Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal
Register (1989), information from Computer Select disks (Ziff-Davis
Publishing) and short abstracts from the Department of Energy. The
second disk contains information from the same sources, but from
different years. The third disk contains more information from the
Computer Select disks, plus material from the San Jose Mercury News
(1991), more AP newswire (1990) and about 250 megabytes of formatted
U.S. Patents. The format of all the documents is relatively clean and
easy to use, with SGML-like tags separating documents and document
fields. There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection
is to test retrieval against real-world data.
The three Tipster discs so far released have been re-issued with
updates and corrections and all recipients of the earlier versions
should have received these replacements free of charge. If you think
you have the unrevised original, contact LDC for confirmation.