The PennBioIE Oncology Corpus consists of 1414 PubMed abstracts on cancer, concentrating on molecular genetics, and comprising approximately 327,000 words of biomedical text,tokenized and annotated for paragraph, sentence, part of speech, and 24 types of biomedical named entities in five categories of interest. 318 of the abstracts have also been syntactically annotated.
All of the annotation was based on Penn Treebank II standards, with some modifications for special characteristics of the biomedical text. The entity definitions were developed and revised in an extensive process of interaction between domain experts and biomedically trained annotators.
The oncology data comprises two subcorpora:
- The Sanger subcorpus (san) consists of abstracts of 577 articles previously annotated by the Sanger Institute for global mention of oncological named entities. These annotations were metadata reflecting the presence or absence of such mentions anywhere in the text, without reference to specific strings. The articles concentrate on variations in a small set of human genes associated with many different types of cancer; they were not part of ongoing work at Sanger, and the annotations were never published. We did not refer to the Sanger annotations after selection of the abstracts.
- The neuroblastoma subcorpus (nb) consists of 837 abstracts of articles
dealing with this particular type of cancer selected by colleagues at Children's
Hospital of Philadelphia. They do not all concentrate on genetics, but they
mention a much larger
number of genes than the Sanger files do.
The data was prepared by the Linguistic Data Consortium for the Institute for Research in Cognitive Science, with funding from the National Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research (ITR) program, in collaboration with Dr. Peter White's group in Pediatric Oncology at the Children's Hospital of Philadelphia.
The corpus contains 1412 PubMed abstracts comprising approximately 381,000 total words of text. Each file has been tokenized and its biomedical portions (327,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity. Each token has a part-of-speech tag.
Tokens and POS tags: Tokens in biomedical and chemical notation and terms,
and spelled-out numbers, may contain whitespace and/or punctuation
("beta, 20 diol", "(Na+ + K+)ATPase",
"two hundred seven"); and named entity mentions may comprise several tokens
("polychlorinated biphenyl preparations").
Tokens and entities do not span sentence boundaries.
Biomedical and non-biomedical text: The title and body of each abstract are
considered to be biomedical text, and the automatic and manual annotations in them
have been extensively curated. Everything else, such as citation information and
author names, is considered non-biomedical; this has not been entity annotated, and
its automated tokenization and part of speech tags have not been curated and are
known to be unreliable. In non-biomedical text, the tag "section" is used instead of
"sentence", allowing users to include or exclude these parts. There are approximately
274,000 words of biomedical text and 54,000 words of non-biomedical text. (Because
of a problem with software maintenance, about 24,000 tokens in biomedical text,
mostly in the nb2 subcorpus, are missing POS tags.)
Domains: The abstracts are divided across two domains:
- the molecular genetics of cancer, from a list selected by the
Cancer Genome Project
of the Sanger Institute (v0.9: 588 files; v1.0: 577 files)
- neuroblastoma, a type
of cancer that develops from nerve tissue in infants and children (v0.9: 569
files; v1.0: 837 files = 392 from v0.9 + 445 new)
The difference between the domains is apparent in the ratio of distinct mentions
(types) of tumor types and of gene, after normalization: 3.5 times as many tumor
types in the Sanger files, but 5.8 times as many genes in the neuroblastoma files.
Other divisions of the corpus: The files are further subdivided by
annotation level into three subcorpora, each
with its own subdirectory on this CD and its own set of metadata files.
Metadata is also provided for
- nb1: neuroblastoma annotated to level 1 (407 files)
- nb2: neuroblastoma annotated to level 2 (430 files)
- san: Sanger annotated to level 2 (all 577 files)
Version 0.9 is included in this release in a separate directory. It is
similarly organized, though with only one level of annotation,
less detailed than v1.0's level 1:
- onco: the entire v1.0 oncology corpus (1414 files)
- nb: nb1 + nb2, all the neuroblastoma data regardless of annotation level (837 files)
- o2: nb2 + san, all the level 2 data regardless of subcorpus (1007 files)
A subset of the v0.9 data was also syntactically annotated (treebanked):
- onco09: all the v0.9 oncology corpus (1157 files)
- nb09: neuroblastoma (569 files)
- san09: Sanger (588 files)
- onco09t: (318 files)
- nb09t: (115 files)
- san09t: (203 files)
Principles and Methods
Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result, annotation practices have sometimes involved compromises which might not have been necessary if the earlier annotation had been able to integrate the requirements of the later work. Such integration is necessary here because of the scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and the goal of making the annotation layers work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. Such integration is also possible given the relatively long term of the grant (five years) and because researchers were starting with fresh text, applying all layers of annotation themselves.
The texts are annotated at the following layers:
- Biomedical entity
- Token and part of speech
- Syntax (treebanking) (some texts only)
- Semantic relations (some oncology texts only)
Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected; entity annotation is manual. The authors originally used a POS tagger trained on Penn Treebank data, which made many errors on the very different text of these biomedical abstracts. When there was enough manually-corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)).
Annotation at all layers except entity is based on the Penn Treebank II guidelines, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators.
For an example of the annotations in this corpus, please consult this page
containing examples of the source text, the standoff annotations, tokenization,
treebank*, and interactive HTML view*.
* v0.9 annotation only
Portions © 2002-2008 Trustees of the University of Pennsylvania