Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Penn Discourse Treebank Version 2.0

Item Name: Penn Discourse Treebank Version 2.0
Authors: Rashmi Prasad, Alan Lee, Nikhil Dinesh, Eleni Miltsakaki, Geraud Campion, Aravind Joshi, Bonnie Webber
LDC Catalog No.: LDC2008T05
ISBN: 1-58563-466-2
Release Date: Feb 18, 2008
Data Type: text
Data Source(s): newswire
Application(s): discourse analysis, discourse parsing, information extraction, information retrieval, language generation, subjectivity analysis, summarization
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2008 members
Non-member Fee: US$1000.00
Reduced-License Fee: N/A
Extra-Copy Fee: US$
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Rashmi Prasad, et al.
2008
Penn Discourse Treebank Version 2.0
Linguistic Data Consortium, Philadelphia


Introduction

The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation. Discourse relations are assumed to have exactly two arguments. PDTB, version 2.0. is a continuation of PDTB, version 1.0. (made available freely in 2006 but no longer available). Following a lexically grounded approach to annotation, the PDTB annotates relations realized explicitly by Explicit connectives drawn from syntactically well-defined classes, as well as relations between adjacent sentences when no Explicit connective appears to relate the two. Arguments of relations are annotated in each case. For Explicit connectives, arguments are unconstrained in terms of their distance from the connective and can be found anywhere in the text. Between adjacent sentences where no Explicit connective appears, four scenarios hold: (a) the sentences may be related by a discourse relation that has no realization in the second sentence, in which case a connective (called an Implicit connective) is provided to express the inferred relation; (b) the sentences may be related by a discourse relation that is realized by some "alternative" non-connective expression, in which case these alternative lexicalizations are annotated as the carriers of the relation (labelled as "AltLex"); (c) the sentences may be related not by a discourse relation, but merely by an "entity-based" coherence relation, in which case the presence of such a relation is labelled (as "EntRel"); and (d) the sentences may not be related at all, in which case they are labelled as such ("NoRel").

In addition to the argument structure of relations, the PDTB provides (a) sense annotations for each discourse relation while also capturing the polysemy of connectives, and (b) attribution annotations of relations and each of their arguments, with each instance of attribution providing the corresponding text span along with four features to capture the semantic contribution of the attribution. Both sense and attribution annotations are provided for Explicit, Implicit, and AltLex relations, but not for EntRel and NoRel.

The lexically grounded approach in the PDTB exposes a clearly defined level of discourse structure which will support the extraction of a range of inferences associated with discourse connectives.

To date, the PDTB group has carried out various experiments on the corpus, particularly examining the following issues:

  • alignment between syntax and discourse, particularly with regards to attribution
  • sense disambiguation of discourse connectives
  • complexity of dependencies in discourse

The annotations in Penn Discourse Treebank Version 2.0 are linked to the Penn Treebank.

The PDTB group will continue to explore these issues and to focus on more extended projects such as discourse parsing, automatic summarization, and natural language generation. Further work will also explore foundational issues in discourse.

PDTB, version 2.0. annotates 40600 discourse relations, distributed into the following five types:

  • 18459 Explicit Relations
  • 16053 Implicit Relations
  • 624 Alternative Lexicalizations
  • 5210 Entity Relations
  • 254 No Relations

Samples

For an example of the data in this corpus, please review the sample below:

________________________________________________________
____Explicit____
544..551
4,2
#### Text ####
however
##############
#### Features ####
Wr, Comm, Null, Null
however, Comparison.Contrast
____Sup1____
374..515
2;3
#### Text ####
Its index inched up to 47.6% in October from 46% in September.
Any reading below 50% suggests the manufacturing sector is generally declining
##############
____Arg1____
288..372
1,3,1,1,1,1
#### Text ####
that the manufacturing economy contracted in October for the sixth consecutive month
##############
#### Features ####
Ot, Comm, Null, Null
260..287
1,3,1,0;1,3,1,1,0;1,3,1,1,1,0
#### Text ####
its latest survey indicated
##############
____Arg2____
563..624
4,5,1
#### Text ####
that orders turned up in October after four months of decline
##############
#### Features ####
Ot, Comm, Null, Null
519..542;553..562
4,0;4,1;4,3;4,4;4,5,0;4,6
#### Text ####
The purchasing managers also said
##############
________________________________________________________

Content Copyright

Portions © 1989 Dow Jones & Company, Inc., © 2008 The Penn Discourse Treebank Group, © 2008 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.