Introduction
BioProp Version 1.0 was developed by researchers at Academia
Sinica, Taipei, Taiwan. It consists of proposition bank-style annotations
for approximately 500 English biomedical journal abstracts. The source abstracts,
annotated in accordance with Penn
Treebank II guidelines, are contained in the GENIA Treebank (GTB). The GTB
was developed at the Tsujii
Laboratory at the University
of Tokyo.
The purpose of the GENIA
Project is to develop tools and resources for automatic information extraction
of biomedical information. One result of that work is the GENIA corpus, a collection
of 2000 biomedical journal abstracts containing semantic class annotation for
biomedical terms, part-of-speech (POS) tags and coreferences. The GTB is a subset
of that corpuse. BioProp Version 1.0 adds a proposition bank to the GTB.
Proposition
Bank (PropBank) contains annotations of predicate argument structures and
semantic roles in a treebank schema in the newswire domain. To construct BioProp
Version 1.0, a semantic role labeling (SRL) system trained on PropBank was used
to annotate the GTB. SRL, also called shallow semantic parsing, is a popular
semantic analysis technique. In SRL, sentences are represented by one or more
predicate-argument structures (PAS), also known as propositions. Each PAS is
composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases)
that have different semantic roles, including main arguments such as agent and
patient, and adjunct arguments, such as time, manner and location. The term
"argument" refers to a syntactic constituent of the sentence related
to the predicate, and the term "semantic role" refers to the semantic
relationship between a sentence's predicate and argument.
To suit the needs in the biomedical domain, the PropBank annotation guidelines
were modified to characterize semantic roles as components of biological events.
Specifically, thirty verbs were selected according to their frequency of use
or importance in biomedical texts. Since targets in information extraction are
relations of named entities, only sentences containing protein or gene names
were used to count each verb's frequency. Verbs of general usage were
filtered out in order to keep the focus on biomedical verbs. Some verbs that
do not have a high frequency but play important roles in describing biomedical
relations, such as "phosphorylate" and "transactivate,"
were also selected. The BioProp annotation was based on Levin?s verb classes
as defined in the VerbNet
lexicon. In VerbNet, the arguments of each verb are represented at the semantic
level, and thus have associated semantic roles. However, since some verbs may
have different usages in biomedical and newswire texts, it is necessary to customize
the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic
method was used to annotate BioProp. The annotation process consisted of the
following steps:
- Identification of predicate candidates
- Automatic annotation of the biomedical semantic roles using newswire SRL
system
- Transformation of automatic tagging results into WordFreak format
- Review by human annotators
Data
BioProp Version 1.0 consists of approximately 150,000 words. Each line in the
corpus provides a PAS annotation that can be mapped to a sentence in the GTB.
Samples
|
91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC
91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC
91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC
91094881 8 217:222 bind 160:183-ARG1 184:210-C-ARG1 211:216-R-ARG1 223:247-ARG2 217:222-rel 248:275-ARGM-ADV
91094881 9 45:53 suppress 0:13-ARGM-ADV 16:38-ARG0 54:78-ARG1 39:44-ARGM-MOD 45:53-rel 79:105-C-ARG1 106:135-ARGM-LOC
91094881 10 49:56 block 0:8-ARGM-DIS 11:44-ARG1 49:56-rel 57:82-ARG0 83:115-ARGM-LOC
91101115 2 99:108 increase 0:98-ARG1 99:108-rel 109:152-ARGM-CAU
91101115 3 159:163 bind 119:153-ARG1 164:191-ARG2 154:158-R-ARG1 159:163-rel
|
Content Copyright
Portions © 2006-2008 Academia Sinica, © 2009 Trustees of the University
of Pennsylvania |