General Guidelines for Entity Annotation

Annotators' home
Onco home Cyp home



Introduction

These are general guidelines for biomedical entity tagging. (Actually, the name is somewhat misleading, since what we are tagging text strings that are references to entities.) The actual definitions of the entity categories are specific to the domains -- oncology and CYP450 inhibition, maybe more later -- but these guidelines apply across domains.

Part of the development of the project and each domain is a process of exploration. Domain experts, information extraction experts, and others come up with a set of categories and rough definitions. Annotators try to apply them, and report back problems, such as

We discuss these questions by email and in weekly meetings, and we revise the definitions, add or subtract or subdivide categories, and gradually settle on categories and definitions that all the annotators can understand clearly and apply consistently.

For consistency is essential. The learning algorithms that are going to analyze the manually tagged text learn well from consistent tagging, but inconsistent tagging makes them produce irregular, randomlike output. In a long-term, multi-person project like this one, all the annotators must develop the same habits and styles of tagging. That means senior personnel need to know how the annotators interpret and apply the definitions, and annotators need to tell them, and also to stay in touch with each other.

Some principles [2005-05-23]°

Regardless of the domain, we maintain a few basic principles in entity tagging, and WordFreak generally enforces them.

Category names and definitions

The definition of each category corresponds only approximately to the normal biomedical definition of the name of the category. E.g., the "Gene" category within the oncology domain also includes proteins; the "Malignancies" category includes benign tumors and precancerous states.

The problem is that for our purposes we need to have categories that are broader than the usual definitions, and there are no convenient existing names for them, so we have to use a name that covers the majority of the kinds of entity we're including. So bear in mind as you work that that names of the categories are approximations, not to be taken in their strict normal biomedical meaning.

These are not the only categories of entity we intend to tag during the course of this project, but they're the ones we're starting with. Some of the categories for future consideration are Organism (such as human, rat, fruit fly), Body Part (heart, kidney, blood), Virus (HIV, EBV), and Cell Line (HeLa, NIH 3T3, human bone marrow culture). And even over the whole course of the project we don't intend to tag everything. So don't tag these and don't worry about not tagging them.

How much of the text to tag

In routine tagging we expect to tag as much of the text as necessary to capture the information about the entity but no more. Many entities are named entities, in that they have standard or fairly standard names that are part of the working vocabulary of a domain specialist, and may be findable in lists. Some examples are

Other examples, such as phrases with modifiers, are less clear. For example, what do we do about about "liver cancer"? "metastatic colon cancer"? "right-sided colon cancer"? These questions are decided by the domain specialists, and the types of modifiers to include or exclude are discussed in individual definitions.

"Information-gathering mode" vs. "defined mode"

[2003-08-05] For some categories at some times we are in "information-gathering mode", including in the tag a (possibly quite long) string of text so the domain experts can analyze different descriptions of that type of entity and decide how to treat it. The oncology annotation group did that with "variation". After analysis we subdivided that category into five or six subcategories, which are for the most part clearly defined and no longer being tagged in information-gathering mode. "Defined mode" is the basic style of tagging, using a (more or less) clear category definition and limiting the tag to a (more or less) restricted section of text.

Abbreviations [2003-11-24]

Often we see a long name followed by an abbreviation in parentheses, or after a dash. If the name refers to a kind of entity that is being tagged in your domain, tag the name and the abbreviation, separately. They are two names for the same entity. Don't tag the parentheses:

[2003-12-11] When considering how much of a long name to tag, take the abbreviation into account. The fact that the authors used an abbreviation for a phrase, even if they invented it, is very strong motivation to tag the whole phrase that it stands for, even if it includes modifiers or other terms that we would otherwise exclude. For example:

Depending on domain-specific guidelines, a modifier such as "selective" might normally be excluded from the tagged string. But here it is obviously part of the phrase, represented by the first "S" in "SSRIs". That's sufficient reason to consider "selective" part of the full name of the entity.

But sometimes such a name+abbreviation is part of a longer name, which continues on after the parenthesized abbreviation. In such cases we do not try to separate out the abbreviation; that would make a tag within tag situation, which we try to avoid. So we just tag the longer name:

Problems with automatic tagging [2003-07-04]

In natural language analysis, a token is something like an atom of text: the smallest part you expect to deal with. Most or all machine processing requires the text to have been analyzed into tokens, or tokenized. The simplest definition of a token is a word, although that's an oversimplification. Like an atom, it has smaller parts -- letters, and maybe punctuation marks as in "o'clock" -- but we may want to consider all of those as parts of the word. (Or we may not: POS (part of speech) tagging requires that "don't" be divided into "do" (a verb) and "n't" (an adverb, as a form of "not").)

An entity reference can include more than one token -- "New York" is a city reference in newswire, "gastrointestinal stromal tumors" is a malignancy reference in oncology, "acetylsalicylic acid" is a substance reference in CYP450 inhibition -- but a tag must never include just part of a token and leave out the rest. It would mess up further analysis, and WordFreak won't allow it.

One simple type of problem that we've seen occurs with internal hyphens. In biomedical text we will generally want to make word-internal punctuation marks separate tokens, except when they're part of normal English words like "we've" and "they're". For example, the tokenizer should split "GIST-specific" into three tokens -- "GIST", "-", "specific" -- so we can tag "GIST" as a malignancy. But the current tokenizer sometimes slips at hyphens and splits it as "GIS", "T-", "specific". Then you, the annotator, have to fix the tokenization.

Sentence tagging can also go wrong. You may see several sentences tagged as one, or the final period of a sentence split off as a sentence or section of its own. (The sentence tagger also marks "sections".) You should fix these errors, at least in the title and text.

Use caution when correcting the automatic tagging. It's safe to reapply the sentence tagger, but since that's probably where the error originated, you should probably fix the sentence tagging by hand. Do not re-run the token tagger after you've started entity tagging; it may mess up the work you've already done. If there's a problem with the token tagging, fix it by hand for the entity you need to tag. You don't need to examine all the tokenization and fix it manually. [Memo to self: check with Graff!]

Tags within tags

Often a reference to an entity will include a reference to another entity. In "ras signal transduction mediators", both "ras" and the entire phrase are gene entities. We have decided, in general, not to tag embedded entity references such as "ras" here; just tag the whole phrase.

° However, we make a few exceptions for some domain-specific situations, and a general exception for information-gathering mode: If a reference that we are tagging in information-gathering mode includes references to entities that we are specific about, then we will still tag the inner entities.

[2004-03-03] For example, imagine that we are annotating articles about ice cream. We have clear definitions for Flavors and Toppings, but Ingredients are still in information-gathering mode. We would tag the whole phrase "fresh strawberry extract" as an Ingredient (info-gathering), and within that phrase "strawberry" would still be tagged as a Flavor.

There are several reasons for this decision.

  1. It can be hard for annotators to analyze the structure of a complex phrase without expert domain knowledge.
  2. The tagged entities will also eventually have pointers into databases containing such information as "ATPase is an enzyme" and "voriconazole is a triazole antifungal agent". Such databases will provide much of the information about embedded entity references.
  3. Many complex entity references are not names that could be found in a list, but constructed phrases a domain specialist would understand, like "cytochrome P450-dependent arachidonate metabolism". Such expressions can be harvested from the tagged references not found in databases and analyzed so that we can develop automatic parsers for them.
  4. References tagged in information-gathering mode will be subjected to human analysis and later revisited under stricter rules. Until we know what we're tagging and not tagging for such categories, we're better off tagging references to entities of categories within them.

General terms

If you see a general term such as "tumor", "cancer", "gene", etc.:
  1. See if it is part of a more specific phrase that still qualifies within the category. If it does, tag that phrase with the appropriate category.
  2. [2003-09-04] Sometimes a general term is added as a way of telling the reader "By the way, this entity belongs to this general class"; the general name appears near the specific one, but it's not part of the specific name. In such cases, don't include the general name in the tag unless it's necessary for distinguishing different entities that could otherwise be confused with each other.

    For example, the same name can refer to a gene or its product. The specifics may vary by domain. The oncology entity annotators have separate tags to distinguish genes from proteins:
     

    The labels carry the distinction; the general term "gene" or "protein" is redundant.

    But CYP450 entity annotators classify proteins as Substances and don't tag genes at all:
     

    With the protein, you need to include the word "protein" to make it clear that this reference is to a Substance.
  3. If the general word is not part of a more specific phrase, does it refer in context to a specific entity or group of entities (not necessarily a small group), or does it refer to all the things of its kind? If it refers to specific instances of the type of entity, tag it; if it's completely general, don't. (The next section covers many of these cases, expressions like "these genes".) Introductory sentences and sections often contain general uses as the writer sets the stage. This may be a hard line to draw; if in doubt, tag it. If you find yourself in doubt often, ask.

[2003-07-04] In a similar vein, annotators have asked whether a class can be a member of the same category as its members. In general, the answer is yes: with the classes we're using, a class of things would generally be in the same category as any of its members or subclasses. So the "gene" category in oncology tagging includes both K-ras (a gene) and Ras (a family of genes), and the "malignancy" category includes both muscle tumors (a class of tumors) and smooth muscle tumors (a subset of the first class). In CYP450 inhibition, picric acid, hydrochloric acid, and acid or acids would all be tagged in the "substance" category.

That generalization doesn't apply when we specifically assign a subclass to a category of its own. In CYP450 inhibition annotation there is a general class of "substances" and a special class, "CYP450". Enzymes are substances, of course, and most enzymes are tagged as "substance", but the enzyme CYP450 and its variations are tagged as "CYP450", not as "substance".

Wait for the diagnosis [2003-07-04]

Don't make an interpretation based on the text; tag only what's actually said. Here's a sentence from an oncology abstract:

Southern blot analysis suggested the presence of these recombinations in the vast majority of AML cells and thus could be used as clonal markers.
The question arose: Since the sentence says that Southern blot analysis indicates the presence of the variation, should Southern blot analysis be included in the string tagged for Variation? [That category was in information-gathering mode at the time.]

The biomedical specialists decided: No; "wait for the diagnosis". Southern blot analysis is an analytic technique, not a variation. The sentence is reporting the results of the analysis. When the text reports a variation, tag that as such, but not the steps the authors took to produce their finding. The same would apply to all other tests and methods.

Discontinuous expressions,
a.k.a. split coordinations or split references ° [2004-10-29]

Very often a number of similar entities or events will be described in a single collapsed manner, such as
loss of heterozygosity (LOH) at seven chromosomal loci (3p14, 7q31-32, 11q13, 13q14, 18q21, 17p13, and 17q21)
Researchers study events and entities individually even though they may describe them collectively, and a description like this refers to seven different entities (loss of heterozygosity at chromosomal locus 3p14, loss of heterozygosity at chromosomal locus 7q31-32, ...), not to a single event involving seven changes. This is no different in principle from such everyday usage as "Bill, Kate, and the Smiths all brought their children to the party" (three families), but it does cause problems for annotation.

What do we do with a phrase like "organic and inorganic acids"? This combined description plainly means "organic acids and inorganic acids", but it doesn't include a string "organic acids" for us to tag.* For a long time we would tag each component of such a discontinuous reference with the same label that we would want to apply to the whole string, and use the comment field of the annotations to record the connection, like this:

text label comment
organic Substance ... acids
inorganic acids Substance (none)
acids Substance organic...

* Don't even think of using "inorganic acids" without the first two letters. (1) It's a cheat. (2) It's not the actual string. (3) You can't generalize such a technique.

Around December 2003 we developed and introduced a chaining tool that lets us annotate discontinuous entities like "organic [...] acids". It's written up in its own file; see there.

Adjectival forms

[Adjectival forms of general words such as "cancerous", "malignant", and "genetic" are not entities; do not tag them.] [The preceding statement is obsolete. [2004-10-29]]

Sometimes specific entity information appears as an adjective rather than its usual noun form, such as "point mutational activities" for the entity "point mutation" (a Variation-type in oncology tagging). In that case we decided to tag "point mutational" as Variation-type in order to capture the important and specific information. Other cases should be brought up for discussion on the appropriate mailing list. [2003-07-31]

Singular and plural [2004-08-20]

We don't distinguish between singular and plural in marking entities. This isn't normally an issue, but sometimes we find plurals irregularly formed (the editor in me wants to say incorrectly formed) with an apostrophe, such as "Ki's" as the plural of "Ki". Tag the whole string, including the "'s", just as you would tag the singular.


CHANGE NOTES


Annotators' home
Onco home Cyp home

2005-05-24