POS Guidelines
(working on duplications)

Annotators' home
POS annotators' page


This file was compiled by Mark Manocchio, Sarah Stippich, and Melissa Demian from the email archive and their meeting notes; many thanks to the three of them! I have added comments and continue to update it.

See also the Cheat Sheet for POS labels.

Contents:

SEQUENCE OF TASKS {2004-06-20}

The needs of entity annotation sometimes conflict with the results of automatic tokenization, even with tokenizers and POS taggers trained on files that we have already annotated. Much of the original and added content of this document is an attempt to address that problem by showing you how to manually tokenize certain constructions and notations as needed for entity annotation, because changing the tokenization often deletes or changes the POS tags, and the entity annotators don't have the training to fix the POS tagging after that.

The original sequence of annotation tasks in this project was
Pretagging -> POS-> Entity -> Treebanking
and the pretaggers applied the POS tagger. In April, 2004, we switched the order of POS and entity annotation, and running the POS tagger is now part of the POS annotators' task.

Now the entity annotators change the tokenization as necessary for themselves. In files that had been pretagged before we changed the sequence of tasks (and a few that were done since then), such changes could have affected the POS tagging, and there may be tokens with no POS tags, and even pieces of text that are not even tokenized. And now, instead of learning to anticipate their tokenization needs, you have to work within the boundaries of the entity tags that they have already assigned. You can see these tags by selecting the appropriate Annotation within WordFreak: Oncology or CYP450.

In files that had been pretagged before we changed the sequence of tasks (and a few that were done since then), such changes could have affected the POS tagging, and there may be tokens with no POS tags, and even pieces of text that are not even tokenized. In later files, the pretagging process includes tokenization but not POS tagging, and you will run the POS tagger as the first step. You will still have to look out for untokenized text, but any incorrect POS tags that you see will have been put there by the automatic tagger.

BASICS AND GENERALITIES

Running the tagger

If the file is not already POS-tagged,

  1. select the Bio POS tagger from WordFreak's Tagger Menu:
    Tagger | Set Tagger | Bio POS
  2. run the tagger: either
    Tagger | Tag
    or
    Ctrl-T
    or
    click the icon of a document with an orange lightning bolt

Nesting of units [2004-08-19]

Every (non-whitespace) character in the file has to be in a Token, every Token has to be in a Sentence or a Section, and every Sentence or Section has to be in a Paragraph. (If you think of the larger units as the parents of the next-smaller ones, there shouldn't be any orphans.) There can be no overlapping: the "child" must be completely contained within the "parent".

Every non-whitespace character must be in a Token. Within the non-linguistic sections (journal name, author name(s), research institution name, PMID number), correct tokenization is not an issue; that is, every non-whitespace character has to be in a token, but the boundaries between the tokens are not our concern. This is discussed in more detail in the Pretagging Guide under Biomedical Text and Otherwise and Non-biomedical material embedded in text.

Parts of speech and tokens [2004-08-20] *

Part of speech tags are applied to tokens. Within WordFreak and this project, every token is exactly one POS and every POS is exactly one token. We use separate programs to divide the text into tokens and to apply POS tags to those tokens before you review them. In WordFreak POS annotation, a token that has not received a POS tag or that has lost its POS tag is tagged as "token".

Entity boundaries: {2004-07-26} *

An entity may contain any number of tokens, but a token must not contain more than one entity. This rule is different from the other nesting rules because not all the text is contained in entities; indeed, most of it is not. WordFreak will not let you specify a token that crosses a sentence or section boundary, but it does not (yet) enforce this rule for entities, so you must enforce it on yourselves:

Don't create a token whose span crosses either edge of an entity. If a POS or token tag is marked as being assigned by an annotator rather than a tagger, as shown in the status line at the bottom of the main Wordfreak window, look at the entity annotation (oncology or CYP450 in the Wordfreak menu) before changing it.

Chained entities [2004-09-15] *

This may come up in several kinds of situation, but chaining is the most likely cause of such a problem, as in the following (nonsense) example:

ENTITY ANNOTATION:       *
      stereo-   and   isometric alleles
      ------             X-------------   (chain)  "stereometric alleles"
                      ----------------- (solid string) "isometric alleles" 
                         * 
                         *  The JJ span crosses the left-hand
                         *  boundary of the second link in the
                         *  chained entity string.
POS ANNOTATION:          *
      stereo-   and   isometric alleles
      --AFX-    -CC   ---X-JJ-- --NNS--
            ^            *
           HYPH          *

Parentheses in entities

An annotator posted this question:

[AUC0-24(P)]
I wanted to tag this as LRB NN RRB, where the NN in that sandwich is the AUC0-24(P). I looked at the CYP annotation just to make sure, and the entity tagger had tagged everything but the last square bracket as a single entity! Can that be right?

No, it can't.

The answer I gave for this particular case is:

Go into the entity annotation for CYP 450, select each of those entities, and use the shrink-left button in the chooser window to pull the entity's left boundary off the left bracket. Then POS-annotate those strings of text.

In general, assume that parentheses, and brackets, are supposed to be balanced. Also assume that in general an entity can be contained in parentheses or brackets, but should not include matching parentheses or brackets at its beginning and end. (But be sure to check for chained entities.)

In the following examples, underlining shows what the entity annotator has marked as a single entity. Validly included material is shown in blue, and invalidly included parentheses are shown in red. Everything said here about parentheses also applies to [square brackets], which we have seen in the biomedical texts, and potentially also to {curly braces} and <angle brackets>, which we have not seen yet. [2004-08-19]

The following are reasonable entity forms:

  1. (abcd)
  2. abc(def)ghi
  3. (a)bcdef
  4. abc(d)
and even
  1. (ab)cde(fgh)
but not this, which includes the balanced parentheses that enclose the text:
  1.   (abcd)  
or any of these, all with unbalanced parentheses (but look out for chains):
  1.   (abcdef  
  2.   abcdef)  
  3.   (ab)cdef)  
  4.   (abcd(fg)  

In example 1 the parentheses contain the entity string but are not part of it: that's fine. In example 2 they are entirely contained within the expression, so you wouldn't even pay attention to them. In examples 3 and 4 the beginning parenthesis is balanced by a matching parenthesis within the expression, so it has to be included to balance its mate.

Example 5 looks as though it includes enclosing parentheses, but each of those marginal parentheses is balanced by a matching parenthesis within the expression. Since the "(" at the left end of the string matches a ")" embedded in the string, they are both part of the entity, and similarly for the ")" at the right end of the string.

In example 6 the text is enclosed in parentheses, but the parentheses themselves are included in the entity expression. This is almost certainly incorrect.

Examples 7-10 have unbalanced parentheses and should be assumed to be incorrect as entities. In each of these cases, presumably, the unmatched marginal parenthesis belongs to the sentence, or possibly to a larger piece of fruit salad that this entity is embedded in. If you see something like these, correct it as described above and report it to the list, with the offending text and the file's source ID and PMID.

Parentheses in chained entities {2004-07-26} *

Before correcting an unbalanced parenthesis in an entity annotation, be sure to check for chains. When an entity annotator uses chaining on a split coordination, individual links may include unmatched parentheses that are balanced, and therefore legitimate, within the entity string as a whole. For example:

text:    1-(1-Benzofuran-2 and 3-yl)-2-mesitylethanone
chain 1: 1-(1-Benzofuran-2      -yl)-2-mesitylethanone
chain 2: 1-(1-Benzofuran-      3-yl)-2-mesitylethanone
Such a situation may be apparent from the text, as in this made-up example (although at least chain 1 is a real chemical name), or it may not show up until you look at the entity annotation view. (Since each of these links is itself fruit salad, you would tag each of them as NN, not AFX.)

False break

Once in a while, a line or space break will show up in files. Sometimes a very long chemical name is longer than the line length of the text, and a line break comes into the middle of it -- equivalent to a space. You can usually recognize this, but DON'T try to correct it or to tag the parts as a single piece; you might guess wrong. Tag them both as NN, and add the comment "false break?" to both. For example,
a new nonsteroidal aromatase inhibitor, R 76 713 (6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H- benzotriazole)...
should be tagged as:
  R             /NN
  76            /CD
  713           /CD
  (             /-LRB-
  6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H-
                /NN [comment: false break?]
  benzotriazole /NN [comment: false break?]
  )             /-RRB-

{9/24}

False line breaks can also occur in regular text. These are like
the false breaks we've seen in very long chemical terms, but in
paragraphs instead: sometimes a blank line gets embedded in the middle
 
of a paragraph of continuous text, as I just did here on purpose. In
this case, include the empty line in the paragraph tag, and make a
"false break" comment in the Chooser window.

Ungrammatical expressions and typos

These can sometimes be due to translation or non-native English speakers writing the abstracts. Tag them as they appear in the text, then make a comment in the Chooser window. For example:

Vinyl chloride (VC) is a know animal and human carcinogen associated with liver angiosarcomas...
Tag "know" as know/VB (even though it should be "known/VBN") and make a comment.

ANNOTATION GUIDELINES

Abbreviations

[2004-07-29] Abbreviations and initials should be tagged as if they were spelled out. (Santorini, section 5.4, p. 32) For example:

Note that many of these can also occur without periods: "a mean +/- SD of 54.2 +/- 29.2 pmol/min/mg" (See also FW.) [2004-08-20]

Abbreviations with variable POS [2004-08-20]

Some biomedical abbreviations can stand for either the adjectival or the adverbial form of the word:

(These can also occur without periods: "When hCG (5 IU) was administered sc and the follicles were isolated 3 h later...".) To tag these correctly you must look at the context: In the last of these examples you can't use the usually reliable technique of reading it out loud to hear which sounds right, the adjective or the adverb, because the syntax is notational rather than English. The text in parentheses, though, modifies the substance being injected, not the act of injection -- compare "as chloride", "in saline solution" -- so we use JJ there.

Unit abbreviation attached to number [2004-08-31]

Sometimes in a measurement the symbol for the unit is attached directly to the number of units. Break these up, tagging the number as CD and the unit symbol as NN (or possibly NNS) (archive):

Singular vs. plural, and plurals with apostrophe ("'s")

All abbreviations for measurements, such as "mm" (millimeter(s)), "nM" (nanomole(s)), and "kDa" (kilodalton(s)), are singular (NN), except for the few like "ins." or "lbs." that are explicitly plural (NNS). {7/24}

We have seen a few instances of abbreviations pluralized with "'s":

Tag the entire string NNS. If we ever come across any multiword entity names pluralized in this way, break at white space as we have been doing all along. [2004-08-20]

Parentheses in abbreviations

[2004-07-29] Many terms referring to measured or calculated values are symbolized with abbreviations that include parentheses. Tag such symbols as NN; do not split them. Of course, if the entity annotators have tagged an entity within such a symbol you will have to split it.

The list includes, but is not limited to:

You will see these much more often in the CYP files than in the oncology files, because they are part of what the CYP researchers are looking for. Many of them also occur without the parentheses. The ones on the list above should not include tagged entities, but other similar symbols may do so, for example, a symbol referring to the concentration of a particular compound.

Affixes

Do not split off a prefix when it is connected directly, without a hyphen (or conceivable other punctuation). Examples: nonfunctioning, antimalarial, precancerous, etc.

AFX is for English affixes, such as non-, anti-, pro-, as well as components of (bio)chemical words, like azo-, azoxy-, and hydro-. Medical components, too. Here's a Q&A from the archives: {2004-04-10}

{changed 2004-04-10}

Biomedical Conventions *

Amino acid substitutions

There is a standard format for representing amino acid and nucleotide substitutions, consisting of either one letter or three letters, then one or more digits, then again either one or three letters (the same number as in the first part). The three-letter amino acid symbols are usually, but not always, cased as capital, small, small.

The oncology entity taggers will have split these up into letter sections and number sections. POS-tag them as NN CD NN.

Examples:

    "Ser726Pro": Ser/NN 726/CD Pro/NN
    "S276P":     S/NN   276/CD P/NN
    "G35A":      G/NN    35/CD A/NN
Amino acid symbols:
The twenty essential nucleic acids and some other amino acids and symbols [IUPAC]:
   Full name           3-letter   1-letter 
                           code   code
alanine                     Ala - A  
arginine                    Arg - R 
asparagine                  Asn - N  
aspartic acid               Asp - D  
cysteine                    Cys - C  
gamma-carboxyglutamate      Gla
glutamate or glutamine      Glx
glutamine                   Gln - Q 
glutamic acid               Glu - E 
glycine                     Gly - G  
histidine                   His - H  
homoserine                  Hse
hydroxylysine               Hyl
hydroxyproline              Hyp
isoleucine                  Ile - I 
leucine                     Leu - L 
lysine                      Lys - K 
methionine                  Met - M 
ornithine                   Orn  
phenylalanine               Phe - F 
proline                     Pro - P 
pyroglutamic acid           Pyr
sarcosine                   Sar
serine                      Ser - S 
threonine                   Thr - T  
tryptophan                  Trp - W
tyrosine                    Tyr - Y 
valine                      Val - V  

(unspecified amino acid)    Xaa
any                         ***
gap of indeterminate length ---
translation stop            TGA
translation stop            TAG
translation stop            TAA

Examples:

    Ser276Pro
    S276P
Nucleotide symbols:

Example:

    G35A
    A56T
{10/15}
Strings of nucleotides

Often we see strings of the letters GCA and T, representing strings of nucleotides on one strand of DNA (or RNA, with U instead of T): "AGTTCA". Don't split these up. The tokenizer will usually make one token of it; leave it that way and tag the whole string as a single noun. {11/20}

Gene variations

One of the many notations for genetic variation looks like this:

  del(8)(q22)
Oncology entity annotators should split out the non-punctuation components of these as entities. The correct POS annotation is nothing surprising: the number is a CD and the other things are NN:
del       /NN
(         /-LRB-
8         /CD
)         /-RRB-
(         /-LRB-
q22       /NN
)         /-RRB-
In particular, 'p' and 'q' refer to the two arms of a chromosome and are often followed by the number of the chromosome. {10/23}

Species names

A Linnaean name follows the format "Genus species". The genus part of the name is always capitalized in this format, the species part never, as in "Homo sapiens". They should both be tagged NNP, capitalized or not. For example:

Species names (rat, rabbit, etc.) are NN, with the exception of "human", which is JJ unless it's used as a noun (see JJ or NN). {10/8}

Mutation names

Since there are so many names for mutations and they manifest themselves as different parts of speech, they are to be tagged as they appear in the text. For example: "Wingless/Wnt" should be tagged

Wingless  /JJ 
/         /SYM
Wnt       /NN
{10/1*}

NN as default tag

Since we don't have the domain experience to know the meaning of everything that comes up in the biomedical files, we use NN as a default tag for many unfamiliar terms.

Hyphenated numbers (not referring to a range) should be tagged NN because they refer to a chemical name. For example:

  Mab 1-68-11
    Mab      /NN
    1-68-11  /NN
Similarly, digit/letter combinations should all be tagged NN. For example:
  P-450 2D
    P-450      /NN
    2D         /NN

  subclass 5a
    subclass   /NN
    5a         /NN

  karyotype 45,XY,-7
    karyotype  /NN
    45,XY,-7   /NN
{10/8, 9/16}

Base pair substitutions*

 G->C
 G-->C
 G----C
 G:C
 G--C
Each of the above should be tagged NN SYM NN. However, "G-C" should be tagged as
	G/NN  -/HYPH  C/NN
because "G-C" could mean either "G-->C" or "C-->G". {8/25}

Collocations and Phrases

Collocations are specific arrangements of words or words habitually found together. They tend to function as a semantic unit (together they have a meaning in the same way a single word has a meaning). Jo Wright has compiled a list from Santorini's guide and other sources.

Complex Chemical Words ("fruit salad")


A

When you have a chemical term with internal punctuation, such as

   2,3,7,8-tetrachlorodibenzo-p-dioxin
treat the entire string as a single token.

Fruit salad and POS [2004-08-30]

"Fruit salad" is our nickname for chemical terms like this that include strings that aren't normally found in words, such as numbers, commas, hyphens, parentheses, and Greek letter names. Fruit salad is usually an NN, but other kinds of speech can also be "fruit salad":

Fruit salad and tokenization

DO adjust the tokenization at the ends, if necessary. For example:

  TCDD (2,3,7,8-tetrachlorodibenzo-p-dioxin)
If the '(' is tokenized together with the '2', you have to separate them, because the digit is part of the chemical name but the parenthesis is not. Tag as
    TCDD                                /NN
    (                                   /-LBR-
    2,3,7,8-tetrachlorodibenzo-p-dioxin /NN
    )                                   /-RBR-
Here the parentheses clearly mark off a synonym for the abbreviation.

Non-substance fruit salad

Fruit salad is not restricted to the names of substances. For example, a biochemical term referring to a process can also be fruit salad (NN or NNS):

  N-oxidation
    N-oxidation   /NN
{7/31}
1

FW (foreign word)

are all FW. ("in" in these phrases is considered FW, not IN: in/FW vivo/FW).

The abbreviations, when printed without a space, will be considered a single FW, following Santorini p. 32: "Abbreviations and initials should be tagged as if they were spelled out." (See also Abbreviations.)

	in/FW       vivo/FW
	in/FW       vitro/FW
	corpora/FW  lutea/FW
	a/FW        priori/FW
	etc.        /FW
	e.g.        /FW 
	eg          /FW 
	               but   e./FW  g./FW  (if with a space)
	i.e.        /FW  
	               but   i./FW  e./FW  (if with a space)

Hyphens

The ASCII character '-' is used in many different ways in the biomedical texts.
  1. An English hyphen, used to combine two or more words (a part of more-or-less normal punctuation):
    Tag each word in the hyphenated expression, and use HYPH for the hyphen.

    In hyphenates like "influenza-like" (an adjective meaning "resembling influenza"), tag "-like" as JJ, not AFX: "influenza/NN -/HYPH like/JJ" (Also mentioned at Guidelines for Specific Words and Terms.)

    See also AFFIXES.
     

  2. An English hyphen next to a nonword:
    Tag the whole expression as a single word, with whatever label is appropriate. See "follow-ups". Fortunately, these are rare. [2004-08-20]
     
  3. An English hyphen in a spelled-out number:
    Don't divide on these.

    See also spelled-out numbers {2004-06-08}
     

  4. An English hyphen used to avoid a confusing spelling:
    Don't divide on these either:

    This is a potentially unclear area. What about, say, "re-alignment"? Is the hyphen there just to avoid the bogus "real" you'd see if it were "realignment"? Don't sweat the small stuff, and this is small stuff: such cases will be rare. Do whatever seems right to you at first thought and move on. {2004-03-31}
     

  5. A fruit salad hyphen:
    Don't divide at the hyphen: leave it as part of the larger token.
     
  6. A negative sign in a scientific notation exponent:
    Don't split it out.
  7. A negative sign not in a scientific notation exponent:
    Tag as SYM.
  8. A subtraction sign:
    Tag as SYM.
  9. A range indicator:
    Tag as HYPH.
      200-230
          200     /CD
          -       /HYPH
          230     /CD
    
      10(-6)-10(-4)
          10(-6)  /CD
          -       /HYPH
          10(-4)  /CD
     
  10. An em dash -- that is, one or two hyphen characters, usually but not always with space on one or both sides, used to mark a break or pause in a sentence or to set off a parenthetical insertion, as in the sentence you are now reading -- is tagged ':' (colon). So you might see the start of this paragraph written as any of (boldface emphasis added):

    NOTE: There is another, less frequent use of "--" as a "second-order hyphen", which we treat differently. [This paragraph changed 11/5, 11/19, 2004-07-21.] {10/15}
       

  11. Sometimes "--" is used as a "second-order hyphen" to connect terms that themselves are hyphenated, or at least one of which is hyphenated:
    Our data provide further evidence for the role of the cytochrome P-450--cytochrome P-450 reductase system in the biotransformation of GTN to an activator (presumably nitric oxide) of guanylyl cyclase.
    The author seems to be linking "cytochrome P-450" with "cytochrome P-450 reductase", and using a double hyphen to show that "P-450-cytochrome" is not meant as a unit. This construction is helpful but nonstandard as far as I know, and may be hard to distinguish from the kind of use shown under Base pair changes, which we tag as SYM. So we'll punt the special case and tag such second-order hyphens as SYM.

    NOTE: To be distinguished from the em dash referred to just above!

{2004-03-31}
 

Hyphenation Special Cases

  1. Two biochemical entities joined as one hyphenated term:
    Follow the entity annotators' lead:
    p21-ras
      p21 /NN            (Gene/gene-RNA)
      -   /HYPH
      ras /NN            (Gene/gene-RNA)
    
    PCR-SSTP
      PCR  /NN           (Substance)
      -    /HYPH
      SSTP /NN           (Substance)
    but not
    N-oxidation
      N          /NN     (not tagged as entity)
      -          /HYPH
      oxidation  /NN     (not tagged as entity)
    (see Fruit Salad) [2004-07-29]
    {10/8, 10/15}

     
  2. List with suspended hyphenations:
    Split and tag all components and hyphens:
    alpha- and beta-testosterone
      alpha       /SYM
      -           /HYPH
      and         /CC
      beta        /SYM
      -           /HYPH
      testosterone/NN
    
    Ha-, Ki-, and N-ras
      Ha          /NN
      -           /HYPH
      ,           /,
      Ki          /NN
      -           /HYPH
      ,           /,
      and         /CC
      N           /NN
      -           /HYPH
      ras         /NN
    {10/15}
     
 

The IN tag

Words like "though", "although", and "whereas" are subordinating conjunctions, /IN. {8/6}

"That", which is typically IN or WDT, can also be a DT. The word "that" is a /DT in the following sentence:

  Rates were an order of magnitude higher than that for control microsomes.

JJ or NN

We have decided to tag certain words as JJ unless we are forced to tag them otherwise (cases of forcing are pluralization, following a determiner with no [other] noun, etc.). For example:

All such words are JJ where possible. Where might it not be possible? For example: Colors are usually tagged JJ, but "blue-shift" presented a problematic case. It means "a shift towards blue/NN", as opposed to "a shift that is blue/JJ". So, "blue-shift" (and "red-shift") is tagged NN HYPH NN. {10/1}

Proper Names

JJ, NN, or NNP

We have very few proper nouns in these files, mostly names of individual persons or organizations ("as Jones has shown"). Proper names of persons, organizations, and places should be tagged as NNP.* Names of drugs or other substances that are capitalized (because they are trademarks) are just plain NN to us; so are gene names and symbols, abbreviations for diseases, and other pieces of biochemical jargon.

But: {2004-06-12}

* However, if the proper name happens to be an adjective, then it must be tagged JJ:

"EC" (=Enzyme Commission) before a dotted-quad enzyme identification code, as an organization name, should be tagged NNP:

  EC 2.6.1.1
    EC        /NNP
    2.6.1.1   /NN
{3/23/04}

Proper name plus "'s" or "'" [2004-08-09]

The possessive form of a proper name should be tagged according to Santorini, with the POS tag on the possessive suffix, even if it is being used as a noun:

disorders like depression and Alzheimer's  
    Alzheimer   /NNP
    's          /POS
the Curies' discoveries
    Curies      /NNPS
    '           /POS

JJ or VBN/VBG

When in doubt as to whether a word is a JJ or a participle (VBN or VBG), favor the participle. Follow this decision tree (thanks to Ann Bies):

  1. Does the word have a truly verbal form?
    No --> tag it JJ
    Yes --> keep going down the list
closing ol here while i figure out what's wrong 050306
B

When you have a chemical term with internal punctuation, such as

   2,3,7,8-tetrachlorodibenzo-p-dioxin
treat the entire string as a single token.

"Fruit salad" is our nickname for chemical terms like this that include strings that aren't normally found in words, such as numbers, commas, hyphens, parentheses, and Greek letter names. Fruit salad is usually an NN, but other kinds of speech can also be "fruit salad":

A biochemical term referring to a process can also be fruit salad (NN or NNS):

  N-oxidation
    N-oxidation   /NN

DO adjust the tokenization at the ends, if necessary. For example:

  TCDD (2,3,7,8-tetrachlorodibenzo-p-dioxin)
If the '(' is tokenized together with the '2', you have to separate them, because the digit is part of the chemical name but the parenthesis is not. Tag as
    TCDD                                /NN
    (                                   /-LBR-
    2,3,7,8-tetrachlorodibenzo-p-dioxin /NN
    )                                   /-RBR-
Here the parentheses clearly mark off a synonym for the abbreviation. {7/31}
2

FW (foreign word)

are all FW. ("in" in these phrases is considered FW, not IN: in/FW vivo/FW).

The abbreviations, when printed without a space, will be considered a single FW, following Santorini p. 32: "Abbreviations and initials should be tagged as if they were spelled out." (See also Abbreviations.)

	in/FW       vivo/FW
	in/FW       vitro/FW
	corpora/FW  lutea/FW
	a/FW        priori/FW
	etc.        /FW
	e.g.        /FW 
	eg          /FW 
	               but   e./FW  g./FW  (if with a space)
	i.e.        /FW  
	               but   i./FW  e./FW  (if with a space)

Hyphens

The ASCII character '-' is used in many different ways in the biomedical texts.
  1. An English hyphen, used to combine two or more words (a part of more-or-less normal punctuation):
    Tag each word in the hyphenated expression, and use HYPH for the hyphen.

    In hyphenates like "influenza-like" (an adjective meaning "resembling influenza"), tag "-like" as JJ, not AFX: "influenza/NN -/HYPH like/JJ" (Also mentioned at Guidelines for Specific Words and Terms.)

    See also AFFIXES.
     

  2. An English hyphen next to a nonword:
    Tag the whole expression as a single word, with whatever label is appropriate. See "follow-ups". Fortunately, these are rare. [2004-08-20]
     
  3. An English hyphen in a spelled-out number:
    Don't divide on these.

    See also spelled-out numbers {2004-06-08}
     

  4. An English hyphen used to avoid a confusing spelling:
    Don't divide on these either:

    This is a potentially unclear area. What about, say, "re-alignment"? Is the hyphen there just to avoid the bogus "real" you'd see if it were "realignment"? Don't sweat the small stuff, and this is small stuff: such cases will be rare. Do whatever seems right to you at first thought and move on. {2004-03-31}
     

  5. A fruit salad hyphen:
    Don't divide at the hyphen: leave it as part of the larger token.
     
  6. A negative sign in a scientific notation exponent:
    Don't split it out.
  7. A negative sign not in a scientific notation exponent:
    Tag as SYM.
  8. A subtraction sign:
    Tag as SYM.
  9. A range indicator:
    Tag as HYPH.
      200-230
          200     /CD
          -       /HYPH
          230     /CD
    
      10(-6)-10(-4)
          10(-6)  /CD
          -       /HYPH
          10(-4)  /CD
     
  10. An em dash -- that is, one or two hyphen characters, usually but not always with space on one or both sides, used to mark a break or pause in a sentence or to set off a parenthetical insertion, as in the sentence you are now reading -- is tagged ':' (colon). So you might see the start of this paragraph written as any of (boldface emphasis added):

    NOTE: There is another, less frequent use of "--" as a "second-order hyphen", which we treat differently. [This paragraph changed 11/5, 11/19, 2004-07-21.] {10/15}
       

  11. Sometimes "--" is used as a "second-order hyphen" to connect terms that themselves are hyphenated, or at least one of which is hyphenated:
    Our data provide further evidence for the role of the cytochrome P-450--cytochrome P-450 reductase system in the biotransformation of GTN to an activator (presumably nitric oxide) of guanylyl cyclase.
    The author seems to be linking "cytochrome P-450" with "cytochrome P-450 reductase", and using a double hyphen to show that "P-450-cytochrome" is not meant as a unit. This construction is helpful but nonstandard as far as I know, and may be hard to distinguish from the kind of use shown under Base pair changes, which we tag as SYM. So we'll punt the special case and tag such second-order hyphens as SYM.

    NOTE: To be distinguished from the em dash referred to just above!

{2004-03-31}
 

Hyphenation Special Cases

  1. Two biochemical entities joined as one hyphenated term:
    Follow the entity annotators' lead:
    p21-ras
      p21 /NN            (Gene/gene-RNA)
      -   /HYPH
      ras /NN            (Gene/gene-RNA)
    
    PCR-SSTP
      PCR  /NN           (Substance)
      -    /HYPH
      SSTP /NN           (Substance)
    but not
    N-oxidation
      N          /NN     (not tagged as entity)
      -          /HYPH
      oxidation  /NN     (not tagged as entity)
    (see Fruit Salad) [2004-07-29]
    {10/8, 10/15}

     
  2. List with suspended hyphenations:
    Split and tag all components and hyphens:
    alpha- and beta-testosterone
      alpha       /SYM
      -           /HYPH
      and         /CC
      beta        /SYM
      -           /HYPH
      testosterone/NN
    
    Ha-, Ki-, and N-ras
      Ha          /NN
      -           /HYPH
      ,           /,
      Ki          /NN
      -           /HYPH
      ,           /,
      and         /CC
      N           /NN
      -           /HYPH
      ras         /NN
    {10/15}
     
 

The IN tag

Words like "though", "although", and "whereas" are subordinating conjunctions, /IN. {8/6}

"That", which is typically IN or WDT, can also be a DT. The word "that" is a /DT in the following sentence:

  Rates were an order of magnitude higher than that for control microsomes.

JJ or NN

We have decided to tag certain words as JJ unless we are forced to tag them otherwise (cases of forcing are pluralization, following a determiner with no [other] noun, etc.). For example:

All such words are JJ where possible. Where might it not be possible? For example: Colors are usually tagged JJ, but "blue-shift" presented a problematic case. It means "a shift towards blue/NN", as opposed to "a shift that is blue/JJ". So, "blue-shift" (and "red-shift") is tagged NN HYPH NN. {10/1}

Proper Names

JJ, NN, or NNP

We have very few proper nouns in these files, mostly names of individual persons or organizations ("as Jones has shown"). Proper names of persons, organizations, and places should be tagged as NNP.* Names of drugs or other substances that are capitalized (because they are trademarks) are just plain NN to us; so are gene names and symbols, abbreviations for diseases, and other pieces of biochemical jargon.

But: {2004-06-12}

* However, if the proper name happens to be an adjective, then it must be tagged JJ:

"EC" (=Enzyme Commission) before a dotted-quad enzyme identification code, as an organization name, should be tagged NNP:

  EC 2.6.1.1
    EC        /NNP
    2.6.1.1   /NN
{3/23/04}

Proper name plus 's or ' [2004-08-09]

The possessive form of a proper name should be tagged according to Santorini, with the POS tag on the possessive suffix, even if it is being used as a noun:

disorders like depression and Alzheimer's  
    Alzheimer   /NNP
    's          /POS
the Curies' discoveries
    Curies      /NNPS
    '           /POS

JJ or VBN/VBG

When in doubt as to whether a word is a JJ or a participle (VBN or VBG), favor the participle. Follow this decision tree (thanks to Ann Bies):

  1. Does the word have a truly verbal form?
    No --> tag it JJ
    Yes --> keep going down the list

  2. Is the verbal meaning in play in this sentence?
    No --> tag it JJ
    Yes --> tag it VBN or keep going

  3. Are there other contextual factors IN THE SAME SENTENCE that force the adjectival or verbal meaning?
    No --> go with the answer from 2.
    Yes --> tag according to those factors
    1. "a very surprised look":
      the presence of a degree adverb forces JJ
    2. "she remained surprised":
      become, feel, look, remain, sound, seem, appear (in the sense of "seem") strongly push for JJ [2004-07-22]
    3. "a cancer - associated gene":
      the presence of the verbal complement (= "a gene associated with cancer") forces VBN
    4. "he remains guided by these principles":
      the presence of the "by"-phrase logical subject forces VBN

I know that this is not quite what is in the POS tagging manual, but it is a version that allows for the VBN/VBG bias. It's also essentially what the treebankers are using at the moment (though we can alter that if need be).

I would expect 1. and 2. to take care of most of the potential ambiguity that we see. I expect most of the weirdness to be in 3., but I think (perhaps too optimistically) that the really ambiguous stuff is in 3b, a construction that doesn't seem to come up all that often in these texts. I half want to say, just stop at 2. and forget about the stuff in 3., by the way. {11/21}


3?

NN or NNS

  1. Parenthesized pluralizers in words like "form(s)":
    If the context gives you evidence for the number (singular or plural) that the author intended, use it. Otherwise use singular (NN).
     
  2. "Data" and similar mass/count words:
    The standard is to go by number agreement when it shows. Tag as NN unless context forces you to use NNS. See also abbreviations.
{2004-07-21}

Numbers

Punctuation in numbers

  1. 3.14159
    186,000

    Don't split on decimal points or numerical commas: each of these is a single CD.
     
  2. 10(-9)M
    The number here is expressed in scientific notation: 10-9M, "ten to the minus ninth mole [or molar]". Tag as:
    	10(-9) / CD
    	M      / NN
  3. 200-230
    This is almost certainly a range of numbers, with the "-" used in the same way as in ordinary text ("I have a meeting, 1:00-2:00 every Thursday.") Tag as:
    	200 / CD
    	-   / HYPH
    	230 / CD
  4. 10(-6)-10(-4)
    This is a range whose limits are given in scientific notation. Combining the previous two guidelines, tag as:
    	10(-6) / CD
    	-      / HYPH
    	10(-4) / CD
    Theoretically it COULD also be a subtraction -- 1/100,000 - 1/1,000 -- but we see little if any arithmetic in these abstracts. If in doubt, look at the context. {7/24}
     
  5. 1.6-fold
    A CD followed by "-fold" is tagged depending on how it is used in the sentence:
    we found a 1.6-fold increase (adjectival)
    	1.6  /CD
    	-    /HYPH
    	fold /JJ
    
    reactivity increased 1.6-fold (adverbial)
    	1.6  /CD
    	-    /HYPH
    	fold /RB

    If there is no hyphen, as in "fourfold", then do not split it up, but the same POS guidelines apply: [2004-07-22]

    we found a fourfold increase (adjectival)
    	fourfold /JJ
    
    reactivity increased fourfold (adverbial)
    	fourfold /RB
    {9/3}
     
  6. +.05
    -3
    48+
    61-

    When a number which is not an exponent in scientific notation is preceded or followed by a plus or minus sign, tag the sign separately as a SYM:
    	-3
    	  -    /SYM
    	  3    /CD
    
    	48+
    	  48   /CD
    	  +    /SYM
    {9/4}
     
  7. 2,3,4- 8,9- and 14,15-hydroxylation
    This refers to three different kinds of hydroxylation, each one categorized by one of the sets of numbers. You would expect to find a list like this divided by commas, at least after the first conjunct (2,3,4-), but it isn't; maybe the authors thought that adding commas would make the situation even more confusing.

    Tag each string of numbers, with the commas between them, as a single NN, and give the hyphens their own HYPH tags:

    2,3,4-  8,9-  and  14,15-hydroxylation°
    	2,3,4                /NN
    	-                    /HYPH
    	8,9                  /NN
    	-                    /HYPH
    	and                  /CC
    	14,15                /NN
    	-                    /HYPH
    	hydroxylation        /NN 

    This is comparable to what we would do in a similar situation not involving fruit salad:

    left- and right-handed spirals
    	left          /JJ
    	-             /HYPH
    	and           /CC
    	right         /JJ
    	-             /HYPH
    	handed        /JJ
    	spirals       /NNS
    
    English-, Spanish-, and Punjabi-speakers
    	English       /NNP
    	-             /HYPHb>
    	,             /,
    	Spanish       /NNP
    	-             /HYPHb>
    	,             /,
    	and           /CC
    	Punjabi       /NNP
    	-             /HYPH
    	speakers      /NNS
    (The name of a language is a proper noun: Santorini, p.13.) {2004-06-09}

Roman numerals

Tag Roman numerals as CD. The Santorini guide takes notice of Roman numerals only in proper names, such as "World War I" or "Pope John XXIII". Biomedical texts have very few proper names, but a number of expressions like "Hepa I" and "subgroup II". These numerals would be NN according to Santorini, but we decided earlier to treat them as CD. {10/8}

Spelled-out numbers

Standard English style dictates that a number of the beginning of the sentence should be spelled out: "Fifteen samples were analyzed" versus "We analyzed 15 samples". Scientific writing does not always observe this rule, but we label a cardinal number as a CD wherever it is in the sentence, whether it is spelled out or written in digits. Even if a spelled-out number is hyphenated or consists of more than one word and incorporates spaces, tag it as a single CD: "forty-five", "one hundred twenty". See also under hyphenation. {2004-06-08}

Number words

Tag "a dozen" as a single token,CD, rather than as DT NN. Do the same with "two dozen" or "seven dozen" or "eight-and-a-half dozen" if you ever see them, and "a hundred" or "two hundred".

But not these:

We are edging into slippery territory here. "Two" and "fourteen" are definitely CDs, and "some" is definitely not a CD, but there is a grey area between them, and we want to establish our boundaries.

There are two kinds of distinction to be made here. One is between exact and fuzzy numbers. "A dozen" means exactly 12, even though the author may be using it approximately. "A hundred", spelled out, suggests an approximation, whereas "100" suggests a precise number, especially in scientific writing, and "one hundred" feels intermediate. Nevertheless, "a hundred" means literally the same as "100", "a pair" refers to 2 items, and "four score" or "fourscore" = 80.

The second issue is syntactic. Our canonical type of CD, an integer written out in digits like "42" or "5280" or "1", can be followed immediately by the noun it quantifies:

So can other forms of numbers that we see in this literature:

The same is true for numbers that are spelled out rather than written in digits:

six years
	six           /CD
	years         /NNS

fifty ways 
	fifty         /CD
	ways          /NNS

(lower) forty-eight states
	forty-eight   /CD
	states        /NNS

So far, so good, but nothing new. Now things begin to get interesting.

"A hundred" and "a dozen" have the same syntax as these if we treat the article as part of the number:

a hundred crates
	a hundred  CD
	crates     NNS

a dozen years
	a dozen    CD
	years      NNS
We are making a jump by treating "a" as part of the number in these cases rather than as a determiner, but, as we're about to see, syntax gives us a good reason to jump no further. (As a matter of fact, this jump is historically a jump backward. The determiner "a" is a phonologically and semantically weakened version of [the old English form of] "one". You can see a relic of that in the form "an", which is now used only before vowel sounds but was formerly used wherever we now say "a".)

Now the big question. What about "dozens" and "a couple" and "a pair"?:

Their syntax is different from that of canonical, core CDs: unlike spelled-out "forty-two" or "a hundred", they must be followed by "of". So even though "a pair" and (usually) "a couple" refer to precisely 2, we will exclude them from the class of CDs because of their syntax:

	     dozens/NNS of/IN reasons/NNS
	a/DT couple/NN  of/IN friends/NNS
	a/DT   pair/NN  of/IN peaches/NNS

So, to sum up, here are the criteria for words that might or might not be tagged as CD:

  1. Does the expression, possibly including adjacent words ("a dozen", "five dozen") mean an exact number? If not, don't use CD.
     
  2. If it does mean an exact number: Is it, or can it be, followed immediately by the noun it quantifies? If not (the only cases I can think of require "of"), it's not a CD.
     
  3. If it means an exact number and be followed immediately by the noun it quantifies, tag it as CD.

NOTE: "Couple" is sometimes used without "of", and is sometimes used to mean something like "several". The latter may be hard to determine, and the former is hardly likely to show up in our texts. So even though "me and a couple friends" might or might not satisfy the above criteria, always tag "couple" as NN. {2004-07-22

"Half"

Santorini (p. 25) recognizes "half" as a JJ, NN, or PDT:

  a half /JJ   point
  half   /NN   of the time
  half   /PDT   his time
  half   /PDT the time
and "one-half" as JJ or RB:
  one-half /JJ  cup
           cf. a full /JJ   cup
  one-half /RB the amount 
           cf. twice  /RB   the amount
               double /RB   the amount

The same logic should apply in expressions like "half as susceptible":

  half        /RB
  as          /IN  (p. 23)
  susceptible /JJ
{2/24/04}

Tag "half-life" as

    half   /JJ
    -      /HYPH
    life   /NN
[2004-07-29]

SYM

Greek letters ("alpha", "beta", "gamma", etc.)

Always tag Greek letters as SYM. (Note that Latin letters such as "X", "G", etc. are tagged as NN.) {2004-06-09}

    THE GREEK ALPHABET

    alpha   iota    rho
    beta    kappa   sigma
    gamma   lambda  tau
    delta   mu      upsilon
    epsilon nu      phi
    zeta    xi      chi
    eta     omicron psi
    theta   pi      omega

Mangled arrows with "greater than" or "&gt;"

  "The mutations were 3 GGT- greater than GAT transitions in codon 12..."
This is supposed to be:
  "The mutations were 3 GGT->GAT transitions in codon 12..."
"- greater than" is intended to be "->"; it's been mangled by some series of operations between the text supplied by the publisher and the version available on Medline. We see various bizarre representations of the "<" and ">" symbols, sometimes alone but often in arrow combinations. They should all be tagged SYM.

All of these mean "gly->val", as do other forms with more or less spacing:

	gly - greater than val
	gly - &gt; val
	gly - > val
	gly -> val
	gly - gt val
	gly - &#62; val     {2004-06-18}
In each case the entire string between "gly" and "val" -- the hyphen and the misrepresentation of ">" -- should be made into a single token and tagged SYM. For example:
	gly                /NN 
	- greater than     /SYM 
	val                /NN
{8/31}

Specific symbols [2004-08-20]

+/-
The "plus or minus" sign, "±", generally appears in our documents as "+/-". Tag this string as a single SYM.
(R)
This can represent the registered trademark symbol "®" when following an NNP. Tag it as SYM in this use. E.g.:
"Non volatile compounds of other beverages such as white wine, grape juice or Xtra Old Cognac(R) displayed lower inhibitory effect..." (PMID 11701226)

RB or RP

Santorini's advice for distinguishing adverbs from particles (p.21) doesn't always help with a word like "up-regulate" or "down-regulation". While she says "it is important to realize that the idiomaticity of a particular collocation is not a diagnostic for the distinction", in this case it helps to note that meaning of the word isn't idiomatic but compositional, that is, predictable from the meanings of the parts. Unlike common particle verbs like "give up" or "turn in", the common meanings of "up" and "down" as 'to a higher value' and 'to a lower value' combine with the meaning of "regulate" to produce the meanings of these words. Even when the verb has been made into a noun, as in "down-regulation", "down" should be RB.

up-regulate
	up         /RB
	-          /HYPH
	regulate   /VB

down-regulation
	down       /RB
	-          /HYPH
	regulation /NN 
{2004-03-31}

Guidelines for Specific Words and Terms

chi square
Tag "square" as JJ here: chi/SYM square/JJ (archive 2004-01-29)
the following
Where "the following" is used as a noun phrase referring to a list ("we have demonstrated the following:"), tag "following" as JJ. (Speaking for Treebanking, Ann Bies agrees.) (archive 04-03-16)
follow-ups
Our general rule of tagging each word and hyphen in non-technical hyphenated expressions separately works well enough for most of the English vocabulary ("state-of-the-art") and English constructions that incorporate technical terms ("glycogen-increasing"). But there's always the occasional troublemaker. The verb "follow up", as in "I'm going to follow up on that problem", is VB RP (Santorini p. 21), and we tag the noun or adjective accordingly, VB HYPH RP ("Wait for the follow-up", "Here's the follow-up study"). But in the plural noun, "ups" is not a word at all, so how should we tag it? We solve this dilemma by ducking out of it and tagging the entire word as NN. Deal similarly with any other English hyphenate that includes a component that is not a word.
half-life
Tag "half" as JJ here: half/JJ -/HYPH life/NN (See "half".) [2004-08-09]
-like
JJ, not AFX. See hyphens.
Soret peak
/NNP /NN [2004-07-29]
-terminal
JJ, not NN. "C-terminal" means "relating to the C terminus". See hyphens.

CHANGE NOTES

(Generally excluding additions dated with change tags.)

(The following two changes were uploaded at the same time as the overall reformatting, on 2004-07-22:)  

'/' tagged as SYM, not HYPH: 2004-07-21. In the earlier text, the tag on '/' was HYPH. That goes back to the meeting on October 1, 2003, with the old tag of '#' for the hyphen). SYM is clearly correct, and that is how we have been tagging it: the entire annotation database contains only one case of '/' tagged as HYPH and two with the '#' tag.  

Base pair substitutions: 2004-07-21. This section was labeled "transversions", a term which refers to only a subset of the cases in which this notation is used.

2004-07-22. Reformatted to facilitate updating.

New examples in Entity boundaries and new subsection there about parentheses in chained entity names.

2004-07-29. Moved suffixal "-like" from Guidelines for Specific Words and Terms to Hyphens, with cross-reference.

2004-07-29. Moved discussion of prefixal "+" and "-" on a number from SYM to Punctuation in Numbers.

2004-08-09. Reformatted Biomedical Conventions. Added boldface on "how to do this" tags.

2004-08-20. Yesterday and today: clarified the definitions of a number of entity types that we have been taking for granted and amplified the discussion of the relationship between tokens and entity mentions.

2004-09-15. Added "chained entities" section based on meeting notes from September 14, 2004.


Annotators' home
POS annotators' page

2005-05-17