| Annotators' home |
| POS annotators' page |
This file was compiled by Mark Manocchio, Sarah Stippich, and Melissa Demian from the email archive and their meeting notes; many thanks to the three of them! I have added comments and continue to update it.
See also the Cheat Sheet for POS labels.
Contents:
The needs of entity annotation sometimes conflict with the results of automatic tokenization, even with tokenizers and POS taggers trained on files that we have already annotated. Much of the original and added content of this document is an attempt to address that problem by showing you how to manually tokenize certain constructions and notations as needed for entity annotation, because changing the tokenization often deletes or changes the POS tags, and the entity annotators don't have the training to fix the POS tagging after that.
The original sequence of annotation tasks in this project was
Pretagging -> POS-> Entity -> Treebanking
and the
pretaggers applied the POS tagger. In April, 2004, we switched the
order of POS and entity annotation, and running the POS tagger is now
part of the POS annotators' task.
Now the entity annotators change the tokenization as necessary for themselves. In files that had been pretagged before we changed the sequence of tasks (and a few that were done since then), such changes could have affected the POS tagging, and there may be tokens with no POS tags, and even pieces of text that are not even tokenized. And now, instead of learning to anticipate their tokenization needs, you have to work within the boundaries of the entity tags that they have already assigned. You can see these tags by selecting the appropriate Annotation within WordFreak: Oncology or CYP450.
In files that had been pretagged before we changed the sequence of tasks (and a few that were done since then), such changes could have affected the POS tagging, and there may be tokens with no POS tags, and even pieces of text that are not even tokenized. In later files, the pretagging process includes tokenization but not POS tagging, and you will run the POS tagger as the first step. You will still have to look out for untokenized text, but any incorrect POS tags that you see will have been put there by the automatic tagger.
If the file is not already POS-tagged,
Every (non-whitespace) character in the file has to be in a Token, every Token has to be in a Sentence or a Section, and every Sentence or Section has to be in a Paragraph. (If you think of the larger units as the parents of the next-smaller ones, there shouldn't be any orphans.) There can be no overlapping: the "child" must be completely contained within the "parent".
Every non-whitespace character must be in a Token. Within the non-linguistic sections (journal name, author name(s), research institution name, PMID number), correct tokenization is not an issue; that is, every non-whitespace character has to be in a token, but the boundaries between the tokens are not our concern. This is discussed in more detail in the Pretagging Guide under Biomedical Text and Otherwise and Non-biomedical material embedded in text.
Part of speech tags are applied to tokens. Within WordFreak and this project, every token is exactly one POS and every POS is exactly one token. We use separate programs to divide the text into tokens and to apply POS tags to those tokens before you review them. In WordFreak POS annotation, a token that has not received a POS tag or that has lost its POS tag is tagged as "token".
An entity may contain any number of tokens, but a token must not contain more than one entity. This rule is different from the other nesting rules because not all the text is contained in entities; indeed, most of it is not. WordFreak will not let you specify a token that crosses a sentence or section boundary, but it does not (yet) enforce this rule for entities, so you must enforce it on yourselves:
Don't create a token whose span crosses either edge of an entity. If a POS or token tag is marked as being assigned by an annotator rather than a tagger, as shown in the status line at the bottom of the main Wordfreak window, look at the entity annotation (oncology or CYP450 in the Wordfreak menu) before changing it.
This may come up in several kinds of situation, but chaining is the most likely cause of such a problem, as in the following (nonsense) example:
ENTITY ANNOTATION: *
stereo- and isometric alleles
------ X------------- (chain) "stereometric alleles"
----------------- (solid string) "isometric alleles"
*
* The JJ span crosses the left-hand
* boundary of the second link in the
* chained entity string.
POS ANNOTATION: *
stereo- and isometric alleles
--AFX- -CC ---X-JJ-- --NNS--
^ *
HYPH *
An annotator posted this question:
[AUC0-24(P)]
I wanted to tag this as LRB NN RRB, where the NN in that sandwich is the AUC0-24(P). I looked at the CYP annotation just to make sure, and the entity tagger had tagged everything but the last square bracket as a single entity! Can that be right?
No, it can't.
The answer I gave for this particular case is:
Go into the entity annotation for CYP 450, select each of those entities, and use the shrink-left button in the chooser window to pull the entity's left boundary off the left bracket. Then POS-annotate those strings of text.
In general, assume that parentheses, and brackets, are supposed to be balanced. Also assume that in general an entity can be contained in parentheses or brackets, but should not include matching parentheses or brackets at its beginning and end. (But be sure to check for chained entities.)
In the following examples, underlining shows what the entity annotator has marked as a single entity. Validly included material is shown in blue, and invalidly included parentheses are shown in red. Everything said here about parentheses also applies to [square brackets], which we have seen in the biomedical texts, and potentially also to {curly braces} and <angle brackets>, which we have not seen yet. [2004-08-19]
The following are reasonable entity forms:
In example 1 the parentheses contain the entity string but are not part of it: that's fine. In example 2 they are entirely contained within the expression, so you wouldn't even pay attention to them. In examples 3 and 4 the beginning parenthesis is balanced by a matching parenthesis within the expression, so it has to be included to balance its mate.
Example 5 looks as though it includes enclosing parentheses, but each of those marginal parentheses is balanced by a matching parenthesis within the expression. Since the "(" at the left end of the string matches a ")" embedded in the string, they are both part of the entity, and similarly for the ")" at the right end of the string.
In example 6 the text is enclosed in parentheses, but the parentheses themselves are included in the entity expression. This is almost certainly incorrect.
Examples 7-10 have unbalanced parentheses and should be assumed to be incorrect as entities. In each of these cases, presumably, the unmatched marginal parenthesis belongs to the sentence, or possibly to a larger piece of fruit salad that this entity is embedded in. If you see something like these, correct it as described above and report it to the list, with the offending text and the file's source ID and PMID.
Before correcting an unbalanced parenthesis in an entity annotation, be sure to check for chains. When an entity annotator uses chaining on a split coordination, individual links may include unmatched parentheses that are balanced, and therefore legitimate, within the entity string as a whole. For example:
text: 1-(1-Benzofuran-2 and 3-yl)-2-mesitylethanone chain 1: 1-(1-Benzofuran-2 -yl)-2-mesitylethanone chain 2: 1-(1-Benzofuran- 3-yl)-2-mesitylethanoneSuch a situation may be apparent from the text, as in this made-up example (although at least chain 1 is a real chemical name), or it may not show up until you look at the entity annotation view. (Since each of these links is itself fruit salad, you would tag each of them as NN, not AFX.)
a new nonsteroidal aromatase inhibitor, R 76 713 (6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H- benzotriazole)...should be tagged as:
R /NN
76 /CD
713 /CD
( /-LRB-
6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H-
/NN [comment: false break?]
benzotriazole /NN [comment: false break?]
) /-RRB-
False line breaks can also occur in regular text. These are like
the false breaks we've seen in very long chemical terms, but in
paragraphs instead: sometimes a blank line gets embedded in the middle
of a paragraph of continuous text, as I just did here on purpose. In
this case, include the empty line in the paragraph tag, and make a
"false break" comment in the Chooser window.
These can sometimes be due to translation or non-native English speakers writing the abstracts. Tag them as they appear in the text, then make a comment in the Chooser window. For example:
Vinyl chloride (VC) is a know animal and human carcinogen associated with liver angiosarcomas...Tag "know" as know/VB (even though it should be "known/VBN") and make a comment.
[2004-07-29] Abbreviations and initials should be tagged as if they were spelled out. (Santorini, section 5.4, p. 32) For example:
Some biomedical abbreviations can stand for either the adjectival or the adverbial form of the word:
Sometimes in a measurement the symbol for the unit is attached directly to the number of units. Break these up, tagging the number as CD and the unit symbol as NN (or possibly NNS) (archive):
72hr (="72 hours")
72 /CD
hr /NN
72hrs
72 /CD
hrs /NNS
0.5(-6)M (="0.5 x 10-6 Mol";
see here)
0.5(-6) /CD
M /NN
All abbreviations for measurements, such as "mm" (millimeter(s)), "nM" (nanomole(s)), and "kDa" (kilodalton(s)), are singular (NN), except for the few like "ins." or "lbs." that are explicitly plural (NNS). {7/24}
We have seen a few instances of abbreviations pluralized with "'s":
[2004-07-29] Many terms referring to measured or calculated values are symbolized with abbreviations that include parentheses. Tag such symbols as NN; do not split them. Of course, if the entity annotators have tagged an entity within such a symbol you will have to split it.
The list includes, but is not limited to:
You will see these much more often in the CYP files than in the
oncology files, because they are part of what the CYP researchers are
looking for. Many of them also occur without the parentheses. The ones
on the list above should not include tagged entities, but other
similar symbols may do so, for example, a symbol referring to the
concentration of a particular compound.
Do not split off a prefix when it is connected directly, without a hyphen (or conceivable other punctuation). Examples: nonfunctioning, antimalarial, precancerous, etc.
AFX is for English affixes, such as non-, anti-, pro-, as well as components of (bio)chemical words, like azo-, azoxy-, and hydro-. Medical components, too. Here's a Q&A from the archives: {2004-04-10}
There is a standard format for representing amino acid and nucleotide substitutions, consisting of either one letter or three letters, then one or more digits, then again either one or three letters (the same number as in the first part). The three-letter amino acid symbols are usually, but not always, cased as capital, small, small.
The oncology entity taggers will have split these up into letter sections and number sections. POS-tag them as NN CD NN.
Examples:
"Ser726Pro": Ser/NN 726/CD Pro/NN
"S276P": S/NN 276/CD P/NN
"G35A": G/NN 35/CD A/NN
Full name 3-letter 1-letter
code code
alanine Ala - A
arginine Arg - R
asparagine Asn - N
aspartic acid Asp - D
cysteine Cys - C
gamma-carboxyglutamate Gla
glutamate or glutamine Glx
glutamine Gln - Q
glutamic acid Glu - E
glycine Gly - G
histidine His - H
homoserine Hse
hydroxylysine Hyl
hydroxyproline Hyp
isoleucine Ile - I
leucine Leu - L
lysine Lys - K
methionine Met - M
ornithine Orn
phenylalanine Phe - F
proline Pro - P
pyroglutamic acid Pyr
sarcosine Sar
serine Ser - S
threonine Thr - T
tryptophan Trp - W
tyrosine Tyr - Y
valine Val - V
(unspecified amino acid) Xaa
any ***
gap of indeterminate length ---
translation stop TGA
translation stop TAG
translation stop TAA
Examples:
Ser276Pro
S276P
Example:
G35A
A56T
{10/15}
Often we see strings of the letters GCA and T, representing strings of nucleotides on one strand of DNA (or RNA, with U instead of T): "AGTTCA". Don't split these up. The tokenizer will usually make one token of it; leave it that way and tag the whole string as a single noun. {11/20}
One of the many notations for genetic variation looks like this:
del(8)(q22)Oncology entity annotators should split out the non-punctuation components of these as entities. The correct POS annotation is nothing surprising: the number is a CD and the other things are NN:
del /NN ( /-LRB- 8 /CD ) /-RRB- ( /-LRB- q22 /NN ) /-RRB-In particular, 'p' and 'q' refer to the two arms of a chromosome and are often followed by the number of the chromosome. {10/23}
A Linnaean name follows the format "Genus species". The genus part of the name is always capitalized in this format, the species part never, as in "Homo sapiens". They should both be tagged NNP, capitalized or not. For example:
Drosophila melanogaster Drosophila /NNP melanogaster /NNP
E. coli E. /NNP * coli /NNP*Note that the period is part of the first token.
Species names (rat, rabbit, etc.) are NN, with the exception of "human", which is JJ unless it's used as a noun (see JJ or NN). {10/8}
Since there are so many names for mutations and they manifest themselves as different parts of speech, they are to be tagged as they appear in the text. For example: "Wingless/Wnt" should be tagged
Wingless /JJ / /SYM Wnt /NN{10/1*}
Since we don't have the domain experience to know the meaning of everything that comes up in the biomedical files, we use NN as a default tag for many unfamiliar terms.
Hyphenated numbers (not referring to a range) should be tagged NN because they refer to a chemical name. For example:
Mab 1-68-11
Mab /NN
1-68-11 /NN
Similarly, digit/letter combinations should all be tagged
NN. For example:
P-450 2D
P-450 /NN
2D /NN
subclass 5a
subclass /NN
5a /NN
karyotype 45,XY,-7
karyotype /NN
45,XY,-7 /NN
{10/8, 9/16}
G->C G-->C G----C G:C G--CEach of the above should be tagged NN SYM NN. However, "G-C" should be tagged as
G/NN -/HYPH C/NNbecause "G-C" could mean either "G-->C" or "C-->G". {8/25}
Collocations are specific arrangements of words or words habitually found together. They tend to function as a semantic unit (together they have a meaning in the same way a single word has a meaning). Jo Wright has compiled a list from Santorini's guide and other sources.
When you have a chemical term with internal punctuation, such as
2,3,7,8-tetrachlorodibenzo-p-dioxintreat the entire string as a single token.
"Fruit salad" is our nickname for chemical terms like this that include strings that aren't normally found in words, such as numbers, commas, hyphens, parentheses, and Greek letter names. Fruit salad is usually an NN, but other kinds of speech can also be "fruit salad":
N-(3,5-dichlorophenyl)-2-hydroxysuccinamic acid
N-(3,5-dichlorophenyl)-2-hydroxysuccinamic /JJ
acid /NN
The "-ic" ending makes the word an adjective.
the 2[N]-methylated compounds
2[N]-methylated /VBN
Since both of the parts are extremely technical words, we should
consider it fruit salad and not break it up. According to Santorini
(pp. 15-17) and our
own adaptation of her adjective vs. participle rules we would call
this JJ, because we have no evidence of a verb "to
2[N]-methylate". But we do have indisputable VBNs of the same form --
-- so we have to conclude that like so many other parts of this technical vocabulary, "2[N]-methylate" is a verb constructed according to productive rules, and "2[N]-methylated" in this sentence is its VBN. (archive)which is N-demethylated by ... was N-demethylated by ... being O-demethylated to ...
DO adjust the tokenization at the ends, if necessary. For example:
TCDD (2,3,7,8-tetrachlorodibenzo-p-dioxin)If the '(' is tokenized together with the '2', you have to separate them, because the digit is part of the chemical name but the parenthesis is not. Tag as
TCDD /NN
( /-LBR-
2,3,7,8-tetrachlorodibenzo-p-dioxin /NN
) /-RBR-
Here the parentheses clearly mark off a synonym for the
abbreviation.
Fruit salad is not restricted to the names of substances. For example, a biochemical term referring to a process can also be fruit salad (NN or NNS):
N-oxidation
N-oxidation /NN
{7/31}
The abbreviations, when printed without a space, will be considered a single FW, following Santorini p. 32: "Abbreviations and initials should be tagged as if they were spelled out." (See also Abbreviations.)
in/FW vivo/FW in/FW vitro/FW corpora/FW lutea/FW a/FW priori/FW etc. /FW e.g. /FW eg /FW but e./FW g./FW (if with a space) i.e. /FW but i./FW e./FW (if with a space)
rat-specific
rat /NN
- /HYPH
specific /JJ
influenza-like
influenza /NN
- /HYPH
like /JJ (not /AFX; archive 04-01-29)
concentration-response
concentration /NN
- /HYPH
response /NN
In hyphenates like "influenza-like" (an adjective meaning "resembling influenza"), tag "-like" as JJ, not AFX: "influenza/NN -/HYPH like/JJ" (Also mentioned at Guidelines for Specific Words and Terms.)
See also AFFIXES.
See also spelled-out numbers
{2004-06-08}
This is a potentially unclear area. What about, say, "re-alignment"?
Is the hyphen there just to avoid the bogus "real" you'd see if it were
"realignment"? Don't sweat the small stuff, and this is small stuff:
such cases will be rare. Do whatever seems right to you at first thought
and move on. {2004-03-31}
k = n - r k /NN = /SYM n /NN - /SYM r /NN
200-230
200 /CD
- /HYPH
230 /CD
10(-6)-10(-4)
10(-6) /CD
- /HYPH
10(-4) /CD
NOTE: There is another, less frequent use of "--" as a "second-order hyphen", which we treat differently.
[This paragraph changed 11/5, 11/19, 2004-07-21.]
{10/15}
Our data provide further evidence for the role of the cytochrome P-450--cytochrome P-450 reductase system in the biotransformation of GTN to an activator (presumably nitric oxide) of guanylyl cyclase.The author seems to be linking "cytochrome P-450" with "cytochrome P-450 reductase", and using a double hyphen to show that "P-450-cytochrome" is not meant as a unit. This construction is helpful but nonstandard as far as I know, and may be hard to distinguish from the kind of use shown under Base pair changes, which we tag as SYM. So we'll punt the special case and tag such second-order hyphens as SYM.
NOTE: To be distinguished from the em dash referred to just above!
p21-ras p21 /NN (Gene/gene-RNA) - /HYPH ras /NN (Gene/gene-RNA) PCR-SSTP PCR /NN (Substance) - /HYPH SSTP /NN (Substance)but not
N-oxidation N /NN (not tagged as entity) - /HYPH oxidation /NN (not tagged as entity) (see Fruit Salad) [2004-07-29]{10/8, 10/15}
alpha- and beta-testosterone alpha /SYM - /HYPH and /CC beta /SYM - /HYPH testosterone/NN Ha-, Ki-, and N-ras Ha /NN - /HYPH , /, Ki /NN - /HYPH , /, and /CC N /NN - /HYPH ras /NN{10/15}
Words like "though", "although", and "whereas" are subordinating conjunctions, /IN. {8/6}
"That", which is typically IN or WDT, can also be a DT. The word "that" is a /DT in the following sentence:
Rates were an order of magnitude higher than that for control microsomes.
We have decided to tag certain words as JJ unless we are forced to tag them otherwise (cases of forcing are pluralization, following a determiner with no [other] noun, etc.). For example:
We have very few proper nouns in these files, mostly names of individual persons or organizations ("as Jones has shown"). Proper names of persons, organizations, and places should be tagged as NNP.* Names of drugs or other substances that are capitalized (because they are trademarks) are just plain NN to us; so are gene names and symbols, abbreviations for diseases, and other pieces of biochemical jargon.
* However, if the proper name happens to be an adjective, then it must be tagged JJ:
"EC" (=Enzyme Commission) before a dotted-quad enzyme identification code, as an organization name, should be tagged NNP:
EC 2.6.1.1
EC /NNP
2.6.1.1 /NN
{3/23/04}
The possessive form of a proper name should be tagged according to Santorini, with the POS tag on the possessive suffix, even if it is being used as a noun:
disorders like depression and Alzheimer's
Alzheimer /NNP
's /POS
the Curies' discoveries
Curies /NNPS
' /POS
When in doubt as to whether a word is a JJ or a participle (VBN or
VBG), favor the participle.
Follow this decision tree (thanks to Ann Bies):
closing ol here while i figure out what's wrong 050306
When you have a chemical term with internal punctuation, such as
2,3,7,8-tetrachlorodibenzo-p-dioxintreat the entire string as a single token.
"Fruit salad" is our nickname for chemical terms like this that include strings that aren't normally found in words, such as numbers, commas, hyphens, parentheses, and Greek letter names. Fruit salad is usually an NN, but other kinds of speech can also be "fruit salad":
N-(3,5-dichlorophenyl)-2-hydroxysuccinamic acid
N-(3,5-dichlorophenyl)-2-hydroxysuccinamic /JJ
acid /NN
The "-ic" ending makes the word an adjective.
the 2[N]-methylated compounds
2[N]-methylated /VBN
Since both of the parts are extremely technical words, we should
consider it fruit salad and not break it up. According to Santorini
(pp. 15-17) and our
own adaptation of her adjective vs. participle rules we would call
this JJ, because we have no evidence of a verb "to 2[N]-methylate".
But we do have indisputable VBNs of the same form --
- which is N-demethylated by ...
- was N-demethylated by ...
- being O-demethylated to ...
A biochemical term referring to a process can also be fruit salad (NN or NNS):
N-oxidation
N-oxidation /NN
DO adjust the tokenization at the ends, if necessary. For example:
TCDD (2,3,7,8-tetrachlorodibenzo-p-dioxin)If the '(' is tokenized together with the '2', you have to separate them, because the digit is part of the chemical name but the parenthesis is not. Tag as
TCDD /NN
( /-LBR-
2,3,7,8-tetrachlorodibenzo-p-dioxin /NN
) /-RBR-
Here the parentheses clearly mark off a synonym for the
abbreviation.
{7/31}
The abbreviations, when printed without a space, will be considered a single FW, following Santorini p. 32: "Abbreviations and initials should be tagged as if they were spelled out." (See also Abbreviations.)
in/FW vivo/FW in/FW vitro/FW corpora/FW lutea/FW a/FW priori/FW etc. /FW e.g. /FW eg /FW but e./FW g./FW (if with a space) i.e. /FW but i./FW e./FW (if with a space)
rat-specific
rat /NN
- /HYPH
specific /JJ
influenza-like
influenza /NN
- /HYPH
like /JJ (not /AFX; archive 04-01-29)
concentration-response
concentration /NN
- /HYPH
response /NN
In hyphenates like "influenza-like" (an adjective meaning "resembling influenza"), tag "-like" as JJ, not AFX: "influenza/NN -/HYPH like/JJ" (Also mentioned at Guidelines for Specific Words and Terms.)
See also AFFIXES.
See also spelled-out numbers
{2004-06-08}
This is a potentially unclear area. What about, say, "re-alignment"?
Is the hyphen there just to avoid the bogus "real" you'd see if it were
"realignment"? Don't sweat the small stuff, and this is small stuff:
such cases will be rare. Do whatever seems right to you at first thought
and move on. {2004-03-31}
k = n - r k /NN = /SYM n /NN - /SYM r /NN
200-230
200 /CD
- /HYPH
230 /CD
10(-6)-10(-4)
10(-6) /CD
- /HYPH
10(-4) /CD
NOTE: There is another, less frequent use of "--" as a "second-order hyphen", which we treat differently.
[This paragraph changed 11/5, 11/19, 2004-07-21.]
{10/15}
Our data provide further evidence for the role of the cytochrome P-450--cytochrome P-450 reductase system in the biotransformation of GTN to an activator (presumably nitric oxide) of guanylyl cyclase.The author seems to be linking "cytochrome P-450" with "cytochrome P-450 reductase", and using a double hyphen to show that "P-450-cytochrome" is not meant as a unit. This construction is helpful but nonstandard as far as I know, and may be hard to distinguish from the kind of use shown under Base pair changes, which we tag as SYM. So we'll punt the special case and tag such second-order hyphens as SYM.
NOTE: To be distinguished from the em dash referred to just above!
p21-ras p21 /NN (Gene/gene-RNA) - /HYPH ras /NN (Gene/gene-RNA) PCR-SSTP PCR /NN (Substance) - /HYPH SSTP /NN (Substance)but not
N-oxidation N /NN (not tagged as entity) - /HYPH oxidation /NN (not tagged as entity) (see Fruit Salad) [2004-07-29]{10/8, 10/15}
alpha- and beta-testosterone alpha /SYM - /HYPH and /CC beta /SYM - /HYPH testosterone/NN Ha-, Ki-, and N-ras Ha /NN - /HYPH , /, Ki /NN - /HYPH , /, and /CC N /NN - /HYPH ras /NN{10/15}
Words like "though", "although", and "whereas" are subordinating conjunctions, /IN. {8/6}
"That", which is typically IN or WDT, can also be a DT. The word "that" is a /DT in the following sentence:
Rates were an order of magnitude higher than that for control microsomes.
We have decided to tag certain words as JJ unless we are forced to tag them otherwise (cases of forcing are pluralization, following a determiner with no [other] noun, etc.). For example:
We have very few proper nouns in these files, mostly names of individual persons or organizations ("as Jones has shown"). Proper names of persons, organizations, and places should be tagged as NNP.* Names of drugs or other substances that are capitalized (because they are trademarks) are just plain NN to us; so are gene names and symbols, abbreviations for diseases, and other pieces of biochemical jargon.
* However, if the proper name happens to be an adjective, then it must be tagged JJ:
"EC" (=Enzyme Commission) before a dotted-quad enzyme identification code, as an organization name, should be tagged NNP:
EC 2.6.1.1
EC /NNP
2.6.1.1 /NN
{3/23/04}
The possessive form of a proper name should be tagged according to Santorini, with the POS tag on the possessive suffix, even if it is being used as a noun:
disorders like depression and Alzheimer's
Alzheimer /NNP
's /POS
the Curies' discoveries
Curies /NNPS
' /POS
When in doubt as to whether a word is a JJ or a participle (VBN or VBG), favor the participle. Follow this decision tree (thanks to Ann Bies):
I know that this is not quite what is in the POS tagging manual, but it is a version that allows for the VBN/VBG bias. It's also essentially what the treebankers are using at the moment (though we can alter that if need be).
I would expect 1. and 2. to take care of most of the potential ambiguity that we see. I expect most of the weirdness to be in 3., but I think (perhaps too optimistically) that the really ambiguous stuff is in 3b, a construction that doesn't seem to come up all that often in these texts. I half want to say, just stop at 2. and forget about the stuff in 3., by the way. {11/21}
10(-9) / CD M / NN
200 / CD - / HYPH 230 / CD
10(-6) / CD - / HYPH 10(-4) / CDTheoretically it COULD also be a subtraction -- 1/100,000 - 1/1,000 -- but we see little if any arithmetic in these abstracts. If in doubt, look at the context. {7/24}
we found a 1.6-fold increase (adjectival) 1.6 /CD - /HYPH fold /JJ reactivity increased 1.6-fold (adverbial) 1.6 /CD - /HYPH fold /RB
If there is no hyphen, as in "fourfold", then do not split it up, but the same POS guidelines apply: [2004-07-22]
we found a fourfold increase (adjectival) fourfold /JJ reactivity increased fourfold (adverbial) fourfold /RB{9/3}
-3 - /SYM 3 /CD 48+ 48 /CD + /SYM{9/4}
Tag each string of numbers, with the commas between them, as a single NN, and give the hyphens their own HYPH tags:
2,3,4- 8,9- and 14,15-hydroxylation° 2,3,4 /NN - /HYPH 8,9 /NN - /HYPH and /CC 14,15 /NN - /HYPH hydroxylation /NN
This is comparable to what we would do in a similar situation not involving fruit salad:
left- and right-handed spirals left /JJ - /HYPH and /CC right /JJ - /HYPH handed /JJ spirals /NNS English-, Spanish-, and Punjabi-speakers English /NNP - /HYPHb> , /, Spanish /NNP - /HYPHb> , /, and /CC Punjabi /NNP - /HYPH speakers /NNS(The name of a language is a proper noun: Santorini, p.13.) {2004-06-09}
Tag Roman numerals as CD. The Santorini guide takes notice of Roman numerals only in proper names, such as "World War I" or "Pope John XXIII". Biomedical texts have very few proper names, but a number of expressions like "Hepa I" and "subgroup II". These numerals would be NN according to Santorini, but we decided earlier to treat them as CD. {10/8}
Standard English style dictates that a number of the beginning of the sentence should be spelled out: "Fifteen samples were analyzed" versus "We analyzed 15 samples". Scientific writing does not always observe this rule, but we label a cardinal number as a CD wherever it is in the sentence, whether it is spelled out or written in digits. Even if a spelled-out number is hyphenated or consists of more than one word and incorporates spaces, tag it as a single CD: "forty-five", "one hundred twenty". See also under hyphenation. {2004-06-08}
Tag "a dozen" as a single token,CD, rather than as DT NN. Do the same with "two dozen" or "seven dozen" or "eight-and-a-half dozen" if you ever see them, and "a hundred" or "two hundred".
But not these:
We are edging into slippery territory here. "Two" and "fourteen" are definitely CDs, and "some" is definitely not a CD, but there is a grey area between them, and we want to establish our boundaries.
There are two kinds of distinction to be made here. One is between exact and fuzzy numbers. "A dozen" means exactly 12, even though the author may be using it approximately. "A hundred", spelled out, suggests an approximation, whereas "100" suggests a precise number, especially in scientific writing, and "one hundred" feels intermediate. Nevertheless, "a hundred" means literally the same as "100", "a pair" refers to 2 items, and "four score" or "fourscore" = 80.
The second issue is syntactic. Our canonical type of CD, an
integer written out in digits like "42" or "5280" or "1", can be
followed immediately by the noun it quantifies:
So can other forms of numbers that we see in this literature:
The same is true for numbers that are spelled out rather than written in digits:
six years six /CD years /NNS fifty ways fifty /CD ways /NNS (lower) forty-eight states forty-eight /CD states /NNS
So far, so good, but nothing new. Now things begin to get interesting.
"A hundred" and "a dozen" have the same syntax as these if we treat the article as part of the number:
a hundred crates a hundred CD crates NNS a dozen years a dozen CD years NNSWe are making a jump by treating "a" as part of the number in these cases rather than as a determiner, but, as we're about to see, syntax gives us a good reason to jump no further. (As a matter of fact, this jump is historically a jump backward. The determiner "a" is a phonologically and semantically weakened version of [the old English form of] "one". You can see a relic of that in the form "an", which is now used only before vowel sounds but was formerly used wherever we now say "a".)
Now the big question. What about "dozens" and "a couple" and "a
pair"?:
Their syntax is different from that of canonical, core CDs: unlike spelled-out "forty-two" or "a hundred", they must be followed by "of". So even though "a pair" and (usually) "a couple" refer to precisely 2, we will exclude them from the class of CDs because of their syntax:
dozens/NNS of/IN reasons/NNS a/DT couple/NN of/IN friends/NNS a/DT pair/NN of/IN peaches/NNS
So, to sum up, here are the criteria for words that might or might not be tagged as CD:
NOTE: "Couple" is sometimes used without "of", and is sometimes used to mean something like "several". The latter may be hard to determine, and the former is hardly likely to show up in our texts. So even though "me and a couple friends" might or might not satisfy the above criteria, always tag "couple" as NN. {2004-07-22
Santorini (p. 25) recognizes "half" as a JJ, NN, or PDT:
a half /JJ point half /NN of the time half /PDT his time half /PDT the timeand "one-half" as JJ or RB:
one-half /JJ cup
cf. a full /JJ cup
one-half /RB the amount
cf. twice /RB the amount
double /RB the amount
The same logic should apply in expressions like "half as susceptible":
half /RB as /IN (p. 23) susceptible /JJ{2/24/04}
Tag "half-life" as
half /JJ
- /HYPH
life /NN
[2004-07-29]
Always tag Greek letters as SYM. (Note that Latin letters such as "X", "G", etc. are tagged as NN.) {2004-06-09}
THE GREEK ALPHABET
alpha iota rho
beta kappa sigma
gamma lambda tau
delta mu upsilon
epsilon nu phi
zeta xi chi
eta omicron psi
theta pi omega
"The mutations were 3 GGT- greater than GAT transitions in codon 12..."This is supposed to be:
"The mutations were 3 GGT->GAT transitions in codon 12...""- greater than" is intended to be "->"; it's been mangled by some series of operations between the text supplied by the publisher and the version available on Medline. We see various bizarre representations of the "<" and ">" symbols, sometimes alone but often in arrow combinations. They should all be tagged SYM.
All of these mean "gly->val", as do other forms with more or less spacing:
gly - greater than val
gly - > val
gly - > val
gly -> val
gly - gt val
gly - > val {2004-06-18}
In each case the entire string between "gly" and
"val" -- the hyphen and the misrepresentation of
">" -- should be made into a single token and tagged
SYM. For example:
gly /NN - greater than /SYM val /NN{8/31}
Santorini's advice for distinguishing adverbs from particles (p.21) doesn't always help with a word like "up-regulate" or "down-regulation". While she says "it is important to realize that the idiomaticity of a particular collocation is not a diagnostic for the distinction", in this case it helps to note that meaning of the word isn't idiomatic but compositional, that is, predictable from the meanings of the parts. Unlike common particle verbs like "give up" or "turn in", the common meanings of "up" and "down" as 'to a higher value' and 'to a lower value' combine with the meaning of "regulate" to produce the meanings of these words. Even when the verb has been made into a noun, as in "down-regulation", "down" should be RB.
up-regulate up /RB - /HYPH regulate /VB down-regulation down /RB - /HYPH regulation /NN{2004-03-31}
(Generally excluding additions dated with change tags.)
(The following two changes were uploaded at the same time as the overall
reformatting, on 2004-07-22:)
'/' tagged as SYM, not HYPH:
2004-07-21. In the earlier text, the tag on '/' was HYPH. That goes
back to the meeting
on October 1, 2003, with the old tag of '#' for the hyphen). SYM
is clearly correct, and that is how we have been tagging it: the
entire annotation database contains only one case of '/' tagged as
HYPH and two with the '#' tag.
Base pair substitutions:
2004-07-21. This section was labeled "transversions", a term which
refers to only a subset of the cases in which this notation is
used.
2004-07-22. Reformatted to facilitate updating.
New examples in Entity boundaries and new
subsection there about parentheses in
chained entity names.
2004-07-29. Moved suffixal "-like" from Guidelines for Specific
Words and Terms to Hyphens, with cross-reference.
2004-07-29. Moved discussion of prefixal "+" and "-" on a number
from SYM to Punctuation in
Numbers.
2004-08-09. Reformatted Biomedical Conventions.
Added boldface on "how to do this" tags.
2004-08-20. Yesterday and today: clarified the definitions of a
number of entity types that we have been taking for granted and
amplified the discussion of the relationship between tokens and entity
mentions.
2004-09-15. Added "chained entities" section based on meeting
notes from September 14, 2004.
| Annotators' home |
| POS annotators' page |
2005-05-17