Defining "chemicals" for tagging


On Thursday, 2003-04-03, some of the members of the ITR/E project met for a "ChemFest", an all-day session to develop at least a draft of a spec for chemical named entity tagging, even as simple as defining a category of "chemical entity", and recognizing that there will be fuzzy edges through which we may want to draw a clean line.

We referred to 40 MedLine abstracts supplied by Andy Schein:

and later to 10 on protein-DNA complexes that Alex brought.

Here's what we decided, insofar as my notes and memory are correct. Questions and unresolved issues are in red. Additions and corrections are invited. -- Mark Mandel 2003-04-03.

[2003-06-18] Now that the CYP450 Inhibition Annotation is using this definition page, I'll be adding to it some decisions we have made, tagged [cyp].

Contents


General principles

  1. We may have somewhat different expectations and requirements of human annotators and tagging programs.
  2. Avoid hierarchical marking where possible; don't ask annotators to subdivide classes.
  3. A class can be a member of the same category as a member of that class, but ...
  4. A class is not coreferent with any proper subset of itself.
  5. Entity-within-entity reference is OK: a string referring to an entity can be a substring of a string referring to another entity.
  6. The oncology tagging does not permit discontinuous entities or allow two entities to overlap (share a common middle section and each have a non-shared section, one to the left and one to the right. This is a restriction of the tool, which seems also to reflect the way language behaves. Should we formalize these restrictions here?
  7. A part or chunk of a molecule, such as a region or a site, is not a "chemical" unless it's out there on its own, no longer part of what it used to belong to.

Terms

A (massively!) non-exhaustive listing of some of the terms we decided are and aren't in the class that we're calling "chemicals".

"chemicals"

may be "chemicals", or not

Ask the human annotators to judge based on context.

not "chemicals"

we're not sure


Syntactic constructions

These are syntactic constructions that require decisions, provide useful information, or are otherwise important. Underlining shows tagged units.

Some of these constructions carry information that we want to be able to tag during the entity tagging, which requires enhancing the tagging tool. I've added that in the comments, but I may have missed some.

Parenthesized abbreviation

1. bovine somatotropin (bST)
   -------------------  ---


This is a very common construction, in which a term is immediately followed by a symbol or abbreviation that will be used for it in the text. The term and the abbreviation are separate entities and coreferent.

Phrase forming a name

2. the phosphorylated form of chemical-name
                              -------------? (see below)
   ----------------------------------------

The whole phrase constitutes a name, analogous to

3. the oxides of copper                   3. cuprous oxide
                 ------? (see below)         -------------
   --------------------

But #3 and #4 are not coreferent (General Principle 4).

Do we want the tagging software to recognize a chemical name within a chemical name? Do we care? (General Principles 1 and 5)

Solutions, etc.

5. a .5M solution of chemical-name
                     -------------

A solution is not a chemical. The same should hold for dilutions and so on. Can we get a(n incomplete) list of such terms?

Right node raising

6. COX-1 ____ but not COX-2 mRNA
   ----------         ----------

The word mRNA is omitted, but understood, after COX-1. Penn Treebanking inserts an empty "right node raising" (RNR) node in that spot, linked to the string mRNA that is in the text after COX-2. We would like to be able to do that in the entity tagging. This requires an enhancement to the tool.

Left node raising

7. CYP1A and ___2D
   -----     -----
 

This is similar to #5, but the actual string in the text (CYP) is to the left of the omission. I don't think Penn Treebanking has left node raising (LNR). Requires enhancement to tool.

8. CYP1A ______ and ___2D enzymes
   ------------     -------------

#7 is actually a simplification. #8 is how we saw this in the text, with LNR of  CYP embedded within RNR of  enzymes.

Determinative condition in parentheses

9. Major testosterone metabolites formed in vitro were 2beta-(CYP3A),
6beta- (CYP3A, CYPIA), and 16beta- (CYP2B) hydroxytestosterone and
androstenedione (CYP2B, CYP2C11).

The expression in blue not only has RNR, each of the left-hand components has a parenthesized trailer, an agent or condition that brought about this particular event: CYP3A induced the formation of 2beta-hydroxytesterone, CYP3A and CYPIA (probably a typo for CYP1A) induced the formation of  6beta-hydroxytesterone, etc. Not to be confused with the abbreviation construction!

Class + instance(s)

10. mGluR agonists L-glutamate and quisqualate
    -----
    -------------- -----------     -----------

mGluR is a "chemical", and so are agonists that act on it. The agonists are a class, to which L-glutamate and quisqualate belong. There is no coreference here (General Principle 4).

We would like to be able to tag this construction, conceptually like this:

       class + instance construction
==========================================
   class         instance        instance
-------------- -----------     -----------
mGluR agonists L-glutamate and quisqualate

This requires an enhancement of the tool.

Chemical + nonchemical

11. EDTA assay
    ----

"Chemical" names are often used as modifiers. If the head of the phrase -- the thing modified -- isn't a "chemical" in our sense, as in #10, it doesn't get tagged, and neither does the phrase as a whole.

Chemical - deverbal adjective + X

12. aspirin-triggered 15-epi-lipoxin A(4)
    -------           -------------------
    -------------------------------------

We often see a chemical name followed by a hyphen and an adjective formed from a verb: triggeredactivated (or activating), and so on.

Measurement

These can be pretty complex, and when rendered in ASCII they can look even worse:

13. For l-methamphetamine, the apparent K(m1) and K(m2) were
        -----------------
    1.07 +/- 0.01 and 350 +/- 2.7 micro M, and V(max1) and V(max2)
    were 4.70 +/- 0.01 and 8.9 +/- 0.02 nmol min(-1) mg protein(-1),
    respectively.                                       -------

The only "chemical" references here are l-methamphetamine and protein . The (-1) after protein is an exponent: mg protein(-1) means "per milligram of protein", as shown below. Not to be confused with ions.

8.9 ± 0.02 nmol
min·mg protein