On Thursday, 2003-04-03, some of the members of the ITR/E project met for a "ChemFest", an all-day session to develop at least a draft of a spec for chemical named entity tagging, even as simple as defining a category of "chemical entity", and recognizing that there will be fuzzy edges through which we may want to draw a clean line.
We referred to 40 MedLine abstracts supplied by Andy Schein:
and later to 10 on protein-DNA complexes that Alex brought.Here's what we decided, insofar as my notes and memory are correct. Questions and unresolved issues are in red. Additions and corrections are invited. -- Mark Mandel 2003-04-03.
[2003-06-18] Now that the CYP450 Inhibition Annotation is using this definition page, I'll be adding to it some decisions we have made, tagged [cyp].
A (massively!) non-exhaustive listing of some of the terms we decided
are and aren't in the class that we're calling "chemicals".
Ask the human annotators to judge based on context.
These are syntactic constructions that require decisions, provide
useful information, or are otherwise important. Underlining shows
tagged units.
Some of these constructions carry information that we want to be able to
tag during the entity tagging, which requires enhancing the tagging tool.
I've added that in the comments, but I may have missed some.
The whole phrase constitutes a name, analogous to
3. the oxides of copper
3. cuprous oxide
------? (see below) -------------
--------------------
But #3 and #4 are not coreferent
(General Principle 4).
Do we want the tagging software
to recognize a chemical name within a chemical name? Do we care?
(General Principles 1 and 5)
5. a .5M
solution of chemical-name
-------------
A solution is not a chemical. The same should hold for dilutions
and so on. Can we get a(n
incomplete) list of such terms?
6. COX-1
____
but not COX-2 mRNA
----------
----------
The word mRNA is
omitted, but understood, after COX-1. Penn Treebanking inserts an empty "right
node raising" (RNR) node in that spot, linked to the string mRNA that is in the text after
COX-2. We would like to
be able to do that in the entity tagging. This requires an enhancement to the tool.
This is similar to #5, but the actual string in the text (CYP) is to the left of the
omission. I don't think Penn Treebanking has left node raising
(LNR). Requires enhancement to
tool.
8. CYP1A
______ and ___2D enzymes
------------ -------------
#7 is actually a simplification. #8 is how we saw this in the text,
with LNR of CYP
embedded within RNR of enzymes.
9. Major
testosterone metabolites formed in vitro were 2beta-(CYP3A),
6beta- (CYP3A,
CYPIA), and 16beta- (CYP2B) hydroxytestosterone and
androstenedione (CYP2B, CYP2C11).
The expression in blue not only has RNR, each of the left-hand
components has a parenthesized trailer, an agent or condition that
brought about this particular event: CYP3A induced the formation of
2beta-hydroxytesterone, CYP3A and CYPIA (probably a typo for CYP1A)
induced the formation of 6beta-hydroxytesterone, etc. Not to be
confused with the abbreviation construction!
10. mGluR agonists L-glutamate
and quisqualate
-----
-------------- -----------
-----------
mGluR is a "chemical", and so are agonists that act on it. The agonists are a class, to which L-glutamate and quisqualate belong. There is no coreference here (General Principle 4).
We would like to be able to tag this construction, conceptually
like this:
This requires an enhancement of the
tool.
11. EDTA
assay
----
"Chemical" names are often used as modifiers. If the head of the
phrase -- the thing modified -- isn't a "chemical" in our sense, as in
#10, it doesn't get tagged, and neither does the phrase as a
whole.
12. aspirin-triggered
15-epi-lipoxin A(4)
-------
-------------------
-------------------------------------
We often see a chemical name followed by a hyphen and an adjective formed
from a verb: triggered, activated (or activating), and so on.
These can be pretty complex, and when rendered in ASCII they can look even
worse:
13. For
l-methamphetamine, the apparent K(m1) and K(m2) were
-----------------
1.07 +/- 0.01 and 350 +/- 2.7 micro M, and V(max1)
and V(max2)
were 4.70 +/- 0.01 and 8.9 +/- 0.02 nmol min(-1)
mg protein(-1),
respectively.
-------
The only "chemical" references here are l-methamphetamine and protein . The (-1) after protein is an exponent: mg protein(-1) means "per
milligram of protein", as shown below. Not to be confused with ions.
8.9 ± 0.02 nmol
min·mg protein