| Annotators' home |
| Oncology annotators' page |
| Existing definitions |
Download WordFreak for Survival-status
Most of the examples are taken from abstracts in our corpus. As usual, "(+)" in a text example indicates chaining. A dotted red underline with side borders surrounds a negative example, something that you might think should be given the tag under discussion but really shouldn't be.
[Still needs more examples, and full annotation of the examples already in it. [2005-02-13]]
Survival-status refers to the most basic issue for all life forms, life or death. For annotation purposes we are splitting it into two components, at least at present: Survival-status and Survival-status-modifier.
The core concept of survival status is simply survival vs. death:
Often you will see it modified by some reference to the state of the disease. As long as such a modifier is contiguous to the core survival status, include it in the entity reference. Several of these combinations are so common as to have standard abbreviations:
Modifiers that do not refer to the state of the disease should not be included in the Survival-status or Survival-status-modifier:
Sometimes a Survival-status has a non-adjacent modifier:
This restriction also lets us to preserve our general Abbreviation Rule, so we don't wind up tagging "progression-free survival (PFS)" as
"Quantity" is not the name of an entity type, like "Gene/RNA" -- it's a heading for grouping together several different types of entity related to numbers and quantities. These files are full of numbers: how many patients in a study, how old this patient was at diagnosis, how long that patient survived, average serum levels before and after treatment, and so on, and on, and on. Some of them are immediately relevant to Survival-status, some are relevant to some other entities we plan on annotating, and some are not relevant to any of them.
But most numerical references have many things in common regardless of what they refer to, and so we can annotate them similarly, and automatic taggers can be trained to recognize them. So when we annotate quantities, we'll categorize them with more attention to mathematics than to medicine, and hook the two sides together later with relationship tagging. We're going to tag almost all numerical references (and some quantitative references that aren't overtly numerical, as well). Most of them will go into one of a few well-defined categories -- Count, Proportion, and Time -- with a "leftovers" category, Measurement, for the rest of them.
We make a basic distinction within these categories: Is the quantity discrete or is it continuous? In barest outline -- see the individual category sections below for details:
The number of individuals in a group is a Count.
The ratio between numbers of individuals in two groups, whether expressed as two integers (17/20, 17 of 20, 17 out of 20) or as a percentage (85%) or even as a decimal (0.85) is a Proportion.
A reference to time is a Time, whether it refers to
A number that refers to anything else is a Measurement.
This section (General notes on quantities) applies in general to all these kinds of quantity.
No matter which category a quantity reference fits into, the tagged string will include one or more of the following:
a quantity. This is usually numeric, whether written in digits or in words:
Nonnumeric quantities: In some cases the quantity can also be a nonnumeric expression, but only if the text can be interpreted mathematically in terms of one of the categories below. The following are all Proportions.°
Non-numeric and not quantities: In contrast, the following do not qualify as quantity strings; we can't do mathematics with them:
found in many tumors
(There's no way of telling how many are enough to be
"many". Many Americans like sushi, but probably under
25%. Many Penn students eat fast food at least three days a
week, probably over 50%... but certainly fewer than the number of
Americans who like sushi.)
in a few hours after
admission [PMID: 10930802]
("Few" and "a few" work, or fail to work, the
same way as "many".)
a range rather than a single quantity. It can be expressed symbolically or verbally:
Beware of expressions like the following, where "from... to..." doesn't indicate a range, but the beginning and end points of a change:
a unit of measurement, where applicable (Time and Measurement, not Count or Proportion):
Things being counted are not units (see Count):
a margin of error:
a qualifier of imprecision:
a limit specifier, either verbal or symbolic (in large red boldface type, because an underlined "<" or ">" symbol looks like a "less than or equal" or "greater than or equal" symbol):
But not a bare equal sign, which doesn't modify the quantity: there is no limit specifier in
Remember, none of the above elements except a single quantity or a range can ever be a whole quantity string by itself, and even a quantity or a range is often accompanied by one or more of the other elements. But any of them may be included in the string, subject to the limitations described below.
Impostors: Some more things that look like quantity expression elements but ain't:
only: "Only" just says that the quantity is less than might be expected, which is not something we can annotate at present.
When a quantitative expression is broken up by something that we don't want to include in it, and we can't chain the parts together, our best solution at present (2005-02-18) is to tag both parts as the same kind of entity reference and use "mod" in the Comment field for one of them to indicate that it should be attached to the other as a modifier:
If the "Fe(II)" were not there, we would tag "1.0 microm and above". This is equivalent to saying ">= 1.0 microm", and we tag that. Tag both "1.0 microm" and "and above" as Measurement, and enter "mod" in the Comment field for "and above". At some point we will have to do some further processing on expressions like these, maybe adding a label Measurement-modifier analogous to Survival-modifier, and the comment "mod" will enable us to find these automatically.
It may not always be clear which part to tag as the "modifier". Use the e-mail list to ask.
Count is that most basic of quantitative concepts, "how many?". Use this for actual counts of individuals: people, tumors, experimental animals, tissue samples, hospitals, studies.... Tag only the number, not the thing counted, which we will not consider a "unit".The number to the right of the "=" in a formulaic statement of sample size ("n = ...") is a Count. The Count is underlined in these examples:
°Count is generally an integer, of course, but an average or other statistical manipulation of several Counts (possibly not mentioned individually in the text) is usually a real number; but tag it as a Count anyway:
Clinical-stage numbers, and similar numbers in expressions like "Type 1" or "group 1", are not Counts. Neither are chromosome numbers, or codon or base pair "addresses" (relative locations).
And don't tag the numbers in expressions like "five days" or "30 mg", which are integers but could just as easily have been real numbers ("5.2 days, 30.0 mg"), unlike true Counts ("5.2 patients"?!). Time and weight are measured, not counted; see following sections. Don't tag the separate integers in Proportions.
In everyday language the word "proportion" means something like "a relationship between a part and the whole with respect to comparative magnitude, quantity, or degree" (adapted from the American Heritage Dictionary). But in our annotation we're limiting it to cases where both the part and the whole are Counts, or rather, would be tagged as Counts if the two of them together didn't qualify as a Proportion; "X out of Y" (expressed in any of several ways). The reason is that our biomedical researchers are interested at the moment in how many individuals out of a group survived or failed to survive; but we will tag any "X out of Y"-type expression as Proportion. Proportion can appear as
Some examples; the Proportion is underlined:
°Don't use Proportion for numbers that don't
relate to groups of individuals. Tag these as Measurement instead:
Proportions don't have units, but you may see them with margin of error, range, and precision or limit specifier.
Don't try to chain backwards or otherwise get bent out of shape in order to combine two Counts into a Proportion. If they aren't in very close context to each other, linked by one of these formulas ("X/Y, X out of Y, X of Y") or something similar, leave them as two Counts.
Time references usually include both the unit and a number, and may also include margin of error, range, and precision and limit specifiers. Calendar references ("1999, October-December") are Times. (If we ever come across a clock reference, like "1:30 pm", it will be too.) Do not include words like "age, aged, old" when they simply tell what is being measured (see Quantitative-classifier). NOTE: This involves a significant redefinition of developmental-state (below).
Examples, with Time underlined:
within the first days following admission
[PMID: 10930802]
("The first days" is a period of time, although we won't
include "following admission" which describes the
period. "Within" and "between" are to periods and
ranges as "=" is to exact values: it associates the quantity
syntactically with the rest of the sentence, but it doesn't affect the
interpretation of the quantity the way "before", "up
to", and "</=" do. See limit specifier.)
This is a sort of wastebasket category for quantitative references that aren't Counts, Proportions, or Times. Like Times, they can include unit, margin of error, range, and precision and limit specifiers. As we have found in CYP annotation, units can get quite messy, involving divisions, multiplications, exponents, and even substance names. We'll try to be consistent on these but we won't spend a lot of time on the ones we aren't focused on.
Examples, with Measurement underlined:
B(max) of 3.14 +/- 0.26 pmol/mg protein
(I.e., 3.14 ± 0.26 picomoles per [milligram of protein].)
8.6+/-1.6 mL min(-1) (mg protein)(-1)
Human placental microsomes (50 micrograms protein) were incubated
(Here the protein is the substance being measured, not part of the unit as in the previous two examples. See next paragraph.)
"Substance" is one of the major entity categories in CYP annotation, and so the CYP annotators are careful to distinguish the substance being measured (#4) from a substance that is part of the unit (#2, 3). We will try to distinguish them, but since the distinction is peripheral to the main focus of our biomedical research, don't spend a lot of time trying to figure out whether a substance is in the denominator or not. If it obviously is or isn't, tag it as you see it; if you can't figure it out quickly (often if it's at the end of a long expression), assume that the substance belongs with the denominator and include it in the quantitative expression as part of the unit.
When we annotate relations later in this project, we will associate quantitative expressions with the entities they describe; for example, in "alive and free of disease from 10 to 239 months", a relation will show that the Time expression "from 10 to 239 months" applies to the Survival-status expression "alive and free of disease".
But the word "old" in "1 year old" doesn't specify a developmental state: a one-year-old isn't old.* In this construction "old" just serves to classify the Time expression. But if we leave this "old" without a tag, how will we be able to specify later that this Time expression "1 year" refers to developmental state rather than to survival time or follow-up period or something else?
(* The distinction depends on the grammatical construction, not the particular number involved, so it is equally true of "old" in "60 years old". But we're not talking about every use of the words "old" or "aged". Sometimes they clearly specify a developmental state, as in "an old man" or "vascular degeneration in the aged". See Developmental-state.)
By labeling it now as a Quantitative-classifier, we will later be able to establish a relation between two tagged entity references, the Time "1 year" and the Quantitative-classifier "old", and then establish that relation as the developmental state associated with a Malignancy, a step-by-step procedure that automatic taggers can be trained to follow.
°We'll use the label Quantitative-classifier for a word or phrase that is closely associated with a quantitative expression and serves only to specify, possibly redundantly, what the expression refers to. When there is a biomedical entity tag for that specifier, such as Developmental-stage, Clinical-stage, or Survival-status, do not double-tag it as Quantitative-classifier.
°In Survival-status annotation most Quantitative-classifiers are going to be mostly terms relating to age -- "age, aged, old" -- but also other Time terms. As we later annotate other aspects of malignancy, such as tumor dimensions, we may find other such words: "3 cm long; 6 feet tall; a mile wide and an inch deep".
To revisit some of our Time examples, underlining Quantitative-classifier:
Other words can fulfill the same function:
average survival time of 4.6 years
("survival" is Survival-status and "average" is a
Statistical-modifier, so don't tag either of them as
Quantitative-classifier.)
median observation time of 19 months
("median" is a Statistical-modifier, so....)
relative molecular mass 41,700
This is a term specifying a statistical manipulation that produces a single number from a set of numbers, generally "average", "mean", or "median":
°We are redefining what used to be Malignancy-developmental-state. As you annotate developmental states under these new definitions, delete the old Malignancy-developmental-state labels.
[Definition carried over from the
earlier definition of Malignancy-developmental-state]
Developmental-state represents different developmental parts of the
lifetime of the malignancy's host (in the sense that a parasite lives
in a host): an individual (usually a patient or patients), a cell
line, or a tissue.
By separating quantitative expressions from the entities they refer to, we're creating a problem for ourselves. Under the older definitions, "young" and "old" (in "old women") and "five years old" would all be tagged the same way, as Malignancy-developmental-state. In the new definitions "five years" is Time and "old" is Quantitative-classifier. If we continue to label "young" and "old (women)" as Malignancy-developmental-state, we run the risk of confusing references tagged under the old definition with references tagged under the new one.
To avoid this problem, we'll use the label "Developmental-state", without "Malignancy", for the new definition, reserving it for words like "young", "infant/infancy", and "old" when it refers to being old rather than to the dimension of age. Here is a list of such words, borrowed from the earlier definition of Malignancy-developmental-state and expanded. We will probably find a few more such words, but we don't expect the list to expand without end.
Development of person or tissue:
Development of cell line [We haven't encountered this yet in Survival-status, so this is still subject to revision. 2005-02-03]:
If one of these terms is accompanied by a modifier of degree, the modifier should be included in the tag as well:
Similarly, some comparative words can also be values of this attribute:
| Annotators' home |
| Oncology annotators' page |
| Existing definitions |
2005-03-10