Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



Chinese Proposition Bank 1.0

Item Name: Chinese Proposition Bank 1.0
Authors: Martha Palmer, Nianwen Xue, Zixin Jiang, and Meiyu Chang
LDC Catalog No.: LDC2005T23
ISBN: 1-58563-354-2
Release Date: Sep 20, 2005
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): natural language processing
Language(s): Mandarin Chinese
Distribution: Web Download
Member fee: $0 for 2005 members
Non-member Fee: US$750.00
Reduced-License Fee: US$375.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription Members, Standard Members, Non-Members
Citation: Martha Palmer, et al.
2005
Chinese Proposition Bank 1.0
Linguistic Data Consortium, Philadelphia

Introduction

Chinese Proposition Bank 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T23 and ISBN 1-58563-354-2.

Chinese Proposition Bank 1.0 is the first public release of the Penn Chinese Proposition Bank project, which aims to create a corpus of text annotated with information about basic semantic propositions. Specifically, predicate-argument relations have been added to the syntactic trees of the first update to Chinese Treebank 5.0 as an additional layer of annotation.

Data

Chinese Proposition Bank 1.0 includes annotations for files chtb_001.fid to chtb_931.fid, or the first 250K words of the first update of Chinese Treebank 5.0. There is a total of 37,183 propositions. Auxiliary verbs are not annotated. Some verbs have light verb and non-light verbs uses and in these cases only the non-light verbs are annotated. All the annotations in this release are the result of double blind annotation followed by adjudication of differences.

The following table summarizes the framesets in CPB 1.0:

Total verbs framed 4,865
Total framesets 5,298
Verbs with multiple framesets 351
Average framesets per verb 1.09

Annotation Format

Each P-A 
    structure is represented in a line of space separated columns.  The columns are as 
    follows

  ctb-filename sentence terminal tagger frameset inflection arglabel arglabel ...

The content of each column is described in detail below.

ctb-filename
	the name of the file in the Penn Chinese TreeBank 5.0 update 1
    
sentence
	the number of the sentence in the file (starting with 0)
    

terminal
	the number of the terminal in the sentence that is the location of the
	verb. Note that the terminal number counts empty constituents as
	terminals and starts with 0.  This will hold for all references to
	terminal number in this description.

    An example:  

    (IP (NP-SBJ (DNP (NP (NN 货币)(NN 回笼))(DEG 的))(NP (NN 增加)))(PU ,) 
    (VP (PP-BNF (P 为)(IP (NP-SBJ (-NONE- *PRO*))(VP (VV 平抑)(NP-OBJ (NP (DP (DT 全)) 
    (NP (NN 区)))(NP (NN 物价))))))(VP (VV 发挥)(AS 了)(NP-OBJ (NN 作用)))) (PU 。))
        
    The terminal numbers:
    货币 0 回笼 1 的 2 增加 3 ,4 为 5 *PRO* 6 平抑 7 全 8 区 9 物价 10 发挥 11
    了 12 作用 13 。14
        

tagger
    the name of the annotator, or "gold" if it's been double annotated and adjudicated.
    
frameset

    The frameset identifier from the frames file of the verb.  For
    example, '发挥.01' refers to the frameset ID "f1" in the frame file for the
    verb '发挥' (frames/0930-fa-hui.xml). The names of the frame files are
    composed of numerical id, plus the pinyin of the verb. The numerical ids
    can be found in the enclosed verb list (verbs.txt). 

        
inflection
    The inflection field is a carry-over from the Penn English Proposition
    Bank, and is set to '-----', meaning no annotation in the Chinese
    Proposition Bank.

arglabel

    A string representing the annotation associated with a particular argument
    or adjunct of the proposition.  Each arglabel is dash '-' delimited and
    has the following columns

  1) column for the address of a constituent
  
    The address of the constituent are in one of the two forms.
    
    form 1: :
      A single node in the syntactic tree of the sentence in question, identified
      by the first terminal the node spans together with the height from that
      terminal to the syntax node (a height of 0 represents a terminal).

      For example,  in the sentence
      

    (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) 
    (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) 
    (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) 
    (VP (VA 好)))(PU 。))


        the address of "1:3" represents the top IP node and "2:2" represents the CP node
        
    form 2: terminal number:height*terminal number:height*...
      
      A trace chain identifying coreference within sentence boundaries.

      For example in the sentence

     (IP (NP-TPC (DP (DT 这些))(CP (WHNP-1 (-NONE- *OP*)) (CP (IP (NP-SBJ (-NONE- *T*-1)) 
     (VP (ADVP (AD 已))(VP (VV 开业))))(DEC 的)))(NP (NN 外商)(NN 投资)(NN 企业))) 
     (NP-ADV (NN 绝大部分))(NP-SBJ (NN 生产)(NN 经营)(NN 状况))(VP (ADVP (AD 较)) 
     (VP (VA 好)))(PU 。))

      the address of  of "2:0*1:0*6:1" represents the fact nodes '2:0' (-NONE-
      *T*-1), '1:0' (-NONE- *OP*) and '6:1' (NP (NN 外商)(NN 投资)(NN 企业))
      are coreferential.

    form 3: terminal number:height,terminal number:height,...
      
      This represents a collection of different pieces of one argument. This
      form is rarely used in the annotation of the verbs, since most
      discontinuous constituents have well-defined relations between their components.
      Therefore the components of a discontinuous constituent are assigned the
      same label with a secondary tag representing their semantic
      relations. For example, if a constituent is marked as ARG0-CRD, it means
      that there is another constituent having the same label and together they
      fill the ARG0 role of the verb.
       
 
  2) column for the 'label'
  
    The argument label one of {rel, ARGM} + { ARG0, ARG1, ARG2,
    ... }.  The argument labels correspond to the argument labels in the frames
    files (see ./frames).  ARGM for adjuncts of various sorts, and 'rel' refers to the 
    surface string of the verb.

  3) column for 'functional tag' (optional for numbered arguments; required for ARGM)

    Functional tags for "split" numbered arguments:

    PSR - possessor
    PSE - possessee
    CRD - coordinator
    PRD - predicate
    QTY - quantity

    Propositional tags for numbered arguments:
    AT, AS, INTO, TOWARDS, TO, ONTO

    Functional tags for ARGM:

    ADV - adverbial, default tag
    BNF - beneficiary
    CND - conditional
    DIR - directional 
    DIS - discourse
    DGR - degree
    EXT - extent
    FRQ - frequency
    LOC - location
    MNR - manner
    NEG - negation
    PRP - purpose and reason
    TMP - temporal
    TPC - topic
    

Samples

For an example of this corpus, please examine this sample xml file.

Content Copyright

Portions © 1994-1998 Xinhua News Agency, © 1996-2001 Sinorama Magazine, © 1997 The Government of the Hong Kong Special Administrative Region, © 2005 Trustees of the University of Pennsylvania


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Da ta

Contact: ldc@ldc.upenn.edu

(c) 1992-2008 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.