Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



SummBank 1.0

Item Name: SummBank 1.0
Authors: Dragomir Radev, Simone Teufel, Horacio Saggion, Wai Lam, John Blitzer, Arda Celebi, Elliott Drabek, Danyu Liu, Hong Qi, and Tim Allison
LDC Catalog No.: LDC2003T16
ISBN: 1-58563-274-0
Release Date: Dec 18, 2003
Data Type: text
Data Source(s): government documents
Application(s): cross-lingual information retrieval, summarization
Language(s): English, Yue Chinese
Language ID(s): eng, yue
Distribution: 4 DVD
Member fee: $0 for 2003 members
Non-member Fee: N/A (Members Only)
Reduced-License Fee: N/A
Extra-Copy Fee: US $800.00
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Dragomir Radev, et al.
2003
SummBank 1.0
Linguistic Data Consortium, Philadelphia

Introduction

SummBank 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T16 and ISBN 1-58563-274-0.

SummBank 1.0 contains the data created for the Summer 2001 Johns Hopkins Workshop which focused on text summarization in a cross-lingual information retrieval framework. For more information about the Johns Hopkins summer workshop on Text Summarization please visit its website. The goal of the corpus is to gather together a corpus of original documents and summaries which can be used as gold standards by the documents summarization community.

The source of the data consists of 18,147 aligned bilingual (Cantonese and English) article pairs from the Information Services Department of the Hong-Kong Special Administrative Region of the People's Republic of China, which were published by the LDC in 2000 as Hong Kong News Parallel Text.

Data

This corpus contains 40 news clusters in English and Chinese, 360 multi-document, human-written non-extractive summaries, and nearly two million single document and multi-document extracts created by automatic and manual methods. The summarizer that was reimplemented and upgraded during the workshop is called MEAD; updated versions of the software are available from the MEAD website.

This distribution includes roughly two million text files, totalling approximately 13GB uncompressed. The text files are encoded either as utf-8 for English or GB or Big-5 for Chinese.

Updates

Additional information, updates, bug fixes may be available on the SummBank website.

Content Copyright

Portions © 1997-2000 The Government of the Hong Kong Special Administrative Region (HKSAR), © 2000, 2003 Trustees of the University of Pennsylvania

Pricing

The Reduced Licensing Fee for this corpus is US$800.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.