Definitions Related to Survival-status and Quantities

Annotators' home
Oncology annotators' page
Existing definitions


Download WordFreak for Survival-status

Most of the examples are taken from abstracts in our corpus. As usual, "(+)" in a text example indicates chaining. A dotted red underline with side borders surrounds a negative example, something that you might think should be given the tag under discussion but really shouldn't be.

[Still needs more examples, and full annotation of the examples already in it. [2005-02-13]]


Survival

Survival-status refers to the most basic issue for all life forms, life or death. For annotation purposes we are splitting it into two components, at least at present: Survival-status and Survival-status-modifier.

Survival-status

The core concept of survival status is simply survival vs. death:

Often you will see it modified by some reference to the state of the disease. As long as such a modifier is contiguous to the core survival status, include it in the entity reference. Several of these combinations are so common as to have standard abbreviations:

(The last two are synonymous, referring to all survivors regardless of disease condition.)

Modifiers that do not refer to the state of the disease should not be included in the Survival-status or Survival-status-modifier:

Many of them will be tagged in other ways; see below.

Survival-status-modifier

Sometimes a Survival-status has a non-adjacent modifier:

"Alive without progression or relapse" would be the Survival-status if it were a continuous string, but it isn't continuous, and the permissibility of chaining here is borderline at best. So in this sentence we would tag "alive" as Survival-status, and "without progression or relapse" as Survival-status-modifier. Use Survival-status-modifier only where the modifier describing state of the disease is not continuous with the core Survival-status term.

This restriction also lets us to preserve our general Abbreviation Rule, so we don't wind up tagging "progression-free survival (PFS)" as

We would have to call "PFS" Survival-status, even though it is equivalent to a phrase that is tagged in pieces. Better to restrict the use of Survival-status-modifier to discontinuous situations, which is simpler than trying to solve problems like this one. After we have annotated the files, we may want to pull out the modifier in some way, but we will at least have data to base a sound decision on.


Quantities

General notes on quantities

"Quantity" is not the name of an entity type, like "Gene/RNA" -- it's a heading for grouping together several different types of entity related to numbers and quantities. These files are full of numbers: how many patients in a study, how old this patient was at diagnosis, how long that patient survived, average serum levels before and after treatment, and so on, and on, and on. Some of them are immediately relevant to Survival-status, some are relevant to some other entities we plan on annotating, and some are not relevant to any of them.

But most numerical references have many things in common regardless of what they refer to, and so we can annotate them similarly, and automatic taggers can be trained to recognize them. So when we annotate quantities, we'll categorize them with more attention to mathematics than to medicine, and hook the two sides together later with relationship tagging. We're going to tag almost all numerical references (and some quantitative references that aren't overtly numerical, as well). Most of them will go into one of a few well-defined categories -- Count, Proportion, and Time -- with a "leftovers" category, Measurement, for the rest of them.

We make a basic distinction within these categories: Is the quantity discrete or is it continuous? In barest outline -- see the individual category sections below for details:

This section (General notes on quantities) applies in general to all these kinds of quantity.

What are quantity strings made of? The elements of quantitative references

No matter which category a quantity reference fits into, the tagged string will include one or more of the following:

This is not a list of quantitative entity types to be tagged separately, but of possible elements of a single quantity string. In each example except the last set, the element of the type under discussion is underlined.

Remember, none of the above elements except a single quantity or a range can ever be a whole quantity string by itself, and even a quantity or a range is often accompanied by one or more of the other elements. But any of them may be included in the string, subject to the limitations described below.

Impostors: Some more things that look like quantity expression elements but ain't:

Interrupted quantity expressions

When a quantitative expression is broken up by something that we don't want to include in it, and we can't chain the parts together, our best solution at present (2005-02-18) is to tag both parts as the same kind of entity reference and use "mod" in the Comment field for one of them to indicate that it should be attached to the other as a modifier:

If the "Fe(II)" were not there, we would tag "1.0 microm and above". This is equivalent to saying ">= 1.0 microm", and we tag that. Tag both "1.0 microm" and "and above" as Measurement, and enter "mod" in the Comment field for "and above". At some point we will have to do some further processing on expressions like these, maybe adding a label Measurement-modifier analogous to Survival-modifier, and the comment "mod" will enable us to find these automatically.

It may not always be clear which part to tag as the "modifier". Use the e-mail list to ask.

Count

Count is that most basic of quantitative concepts, "how many?". Use this for actual counts of individuals: people, tumors, experimental animals, tissue samples, hospitals, studies.... Tag only the number, not the thing counted, which we will not consider a "unit".The number to the right of the "=" in a formulaic statement of sample size ("n = ...") is a Count. The Count is underlined in these examples:

°Count is generally an integer, of course, but an average or other statistical manipulation of several Counts (possibly not mentioned individually in the text) is usually a real number; but tag it as a Count anyway:

What's not a Count?

Clinical-stage numbers, and similar numbers in expressions like "Type 1" or "group 1", are not Counts. Neither are chromosome numbers, or codon or base pair "addresses" (relative locations).

And don't tag the numbers in expressions like "five days" or "30 mg", which are integers but could just as easily have been real numbers ("5.2 days, 30.0 mg"), unlike true Counts ("5.2 patients"?!). Time and weight are measured, not counted; see following sections. Don't tag the separate integers in Proportions.

Proportion

In everyday language the word "proportion" means something like "a relationship between a part and the whole with respect to comparative magnitude, quantity, or degree" (adapted from the American Heritage Dictionary). But in our annotation we're limiting it to cases where both the part and the whole are Counts, or rather, would be tagged as Counts if the two of them together didn't qualify as a Proportion; "X out of Y" (expressed in any of several ways). The reason is that our biomedical researchers are interested at the moment in how many individuals out of a group survived or failed to survive; but we will tag any "X out of Y"-type expression as Proportion. Proportion can appear as

In percentages, include the percentage sign in the string, with chaining if necessary. Don't tag the integers of a Proportion as Counts.

Some examples; the Proportion is underlined:

°Don't use Proportion for numbers that don't relate to groups of individuals. Tag these as Measurement instead:

Proportions don't have units, but you may see them with margin of error, range, and precision or limit specifier.

Don't try to chain backwards or otherwise get bent out of shape in order to combine two Counts into a Proportion. If they aren't in very close context to each other, linked by one of these formulas ("X/Y, X out of Y, X of Y") or something similar, leave them as two Counts.

Time

Time references usually include both the unit and a number, and may also include margin of error, range, and precision and limit specifiers. Calendar references ("1999, October-December") are Times. (If we ever come across a clock reference, like "1:30 pm", it will be too.) Do not include words like "age, aged, old" when they simply tell what is being measured (see Quantitative-classifier). NOTE: This involves a significant redefinition of developmental-state (below).

Examples, with Time underlined:

Measurement

This is a sort of wastebasket category for quantitative references that aren't Counts, Proportions, or Times. Like Times, they can include unit, margin of error, range, and precision and limit specifiers. As we have found in CYP annotation, units can get quite messy, involving divisions, multiplications, exponents, and even substance names. We'll try to be consistent on these but we won't spend a lot of time on the ones we aren't focused on.

Examples, with Measurement underlined:

  1. 48 +/- 10.2 ng/ml
  2. B(max) of 3.14 +/- 0.26 pmol/mg protein
    (I.e., 3.14 ± 0.26 picomoles per [milligram of protein].)

  3. 8.6+/-1.6 mL min(-1) (mg protein)(-1)


  4. (The parenthesized numbers are exponents.)

  5. Human placental microsomes (50 micrograms protein) were incubated
    (Here the protein is the substance being measured, not part of the unit as in the previous two examples. See next paragraph.)

  6. P < 0.001
  7. P = 0.044
  8. P = 0.009-0.003
  9. chemotherapy-alone group (P = 0.047)

"Substance" is one of the major entity categories in CYP annotation, and so the CYP annotators are careful to distinguish the substance being measured (#4) from a substance that is part of the unit (#2, 3). We will try to distinguish them, but since the distinction is peripheral to the main focus of our biomedical research, don't spend a lot of time trying to figure out whether a substance is in the denominator or not. If it obviously is or isn't, tag it as you see it; if you can't figure it out quickly (often if it's at the end of a long expression), assume that the substance belongs with the denominator and include it in the quantitative expression as part of the unit.

Quantitative-classifier

When we annotate relations later in this project, we will associate quantitative expressions with the entities they describe; for example, in "alive and free of disease from 10 to 239 months", a relation will show that the Time expression "from 10 to 239 months" applies to the Survival-status expression "alive and free of disease".

But the word "old" in "1 year old" doesn't specify a developmental state: a one-year-old isn't old.* In this construction "old" just serves to classify the Time expression. But if we leave this "old" without a tag, how will we be able to specify later that this Time expression "1 year" refers to developmental state rather than to survival time or follow-up period or something else?

(* The distinction depends on the grammatical construction, not the particular number involved, so it is equally true of "old" in "60 years old". But we're not talking about every use of the words "old" or "aged". Sometimes they clearly specify a developmental state, as in "an old man" or "vascular degeneration in the aged". See Developmental-state.)

By labeling it now as a Quantitative-classifier, we will later be able to establish a relation between two tagged entity references, the Time "1 year" and the Quantitative-classifier "old", and then establish that relation as the developmental state associated with a Malignancy, a step-by-step procedure that automatic taggers can be trained to follow.

°We'll use the label Quantitative-classifier for a word or phrase that is closely associated with a quantitative expression and serves only to specify, possibly redundantly, what the expression refers to. When there is a biomedical entity tag for that specifier, such as Developmental-stage, Clinical-stage, or Survival-status, do not double-tag it as Quantitative-classifier.

°In Survival-status annotation most Quantitative-classifiers are going to be mostly terms relating to age -- "age, aged, old" -- but also other Time terms. As we later annotate other aspects of malignancy, such as tumor dimensions, we may find other such words: "3 cm long; 6 feet tall; a mile wide and an inch deep".

To revisit some of our Time examples, underlining Quantitative-classifier:

Other words can fulfill the same function:

We will probably encounter and decide on others, but we haven't determined the limits yet. For discussion on the e-mail list.

Statistical-modifier

This is a term specifying a statistical manipulation that produces a single number from a set of numbers, generally "average", "mean", or "median":

We will also use it for "mean +/- SD", "mean value +/- 2 SD", and probably other such things that we will encounter (at least, until and unless we figure out a better way to tag these!). °


Developmental-state

°We are redefining what used to be Malignancy-developmental-state. As you annotate developmental states under these new definitions, delete the old Malignancy-developmental-state labels.

[Definition carried over from the earlier definition of Malignancy-developmental-state]
Developmental-state represents different developmental parts of the lifetime of the malignancy's host (in the sense that a parasite lives in a host): an individual (usually a patient or patients), a cell line, or a tissue.

By separating quantitative expressions from the entities they refer to, we're creating a problem for ourselves. Under the older definitions, "young" and "old" (in "old women") and "five years old" would all be tagged the same way, as Malignancy-developmental-state. In the new definitions "five years" is Time and "old" is Quantitative-classifier. If we continue to label "young" and "old (women)" as Malignancy-developmental-state, we run the risk of confusing references tagged under the old definition with references tagged under the new one.

To avoid this problem, we'll use the label "Developmental-state", without "Malignancy", for the new definition, reserving it for words like "young", "infant/infancy", and "old" when it refers to being old rather than to the dimension of age. Here is a list of such words, borrowed from the earlier definition of Malignancy-developmental-state and expanded. We will probably find a few more such words, but we don't expect the list to expand without end.

If one of these terms is accompanied by a modifier of degree, the modifier should be included in the tag as well:

Similarly, some comparative words can also be values of this attribute:


CHANGE NOTES


Annotators' home
Oncology annotators' page
Existing definitions

2005-03-10