The Gene-Entity category includes genes as well as their downstream products such as transcripts and proteins, in addition to the more general groups of gene and protein families, super-families, and so forth.
Note that the category name 'Gene-Entity’ is not a completely accurate description of the members of this class since the category includes things other than genes. However, most things in this class are genes, and everything is either a gene or gene derived (transcripts and proteins). The diagram that follows attempts to illustrate this point and provides some examples.
There are two ways to think about genes.
1. Genes as conceptual entities. (This is what we want to capture.)
Genes refer to segments of the genome which have been identified with a specific function or product (for example, the gene for eye color in a fly or a membrane receptor in humans). Although they are "things", they really represent abstract concepts. We can talk about the gene "K-Ras", but we are really referring to an abstract concept – an "ideal form" of the K-Ras gene, which has known attributes. We can’t point to K-Ras; we can only point to instances of K-Ras. Each of these instances (a specific manifestation of the gene as described in #2 below) has the attributes and characteristics of the abstract concept of K-Ras but the different instances of K-Ras may vary slightly between them. (This parallels the concept of "species". We all have an intuitive grasp of the species concept, and can differentiate most species apart: a grizzly bear from a polar bear. However, when we visit the zoo we encounter instances of a species -- individual bears -- and not the concept itself.) Although this may seem pedantic, there is an important reason for making this distinction which we’ll describe below.
Let’s consider some examples based upon this logic:
- For genes: c-kit, CD117, and alpha-smooth muscle actin
- A non-biology example: a 2003 Ferrari Modena. This is an abstract concept for a specific type of car. However, you can’t point to an abstract 2003 Ferrari Modena, you can only point to specific instances which may vary, even if slightly, between one another.
- K-Ras as investigated in Bob. This can be a tricky example since it would appear as though we are talking about a specific instance of K-Ras. But remember, in nearly all cases, genes are paired in humans (sometimes there are even more copies). So in reality, we are still talking about the concept of K-Ras. It only becomes specific when we refer to an actual "readable" sequence that is unique to Bob. This same logic applies to populations and families, as well as individuals.
It is this abstract notion of gene (as well as protein and transcript) that we are trying to capture in this project. Therefore, we would also like to include the "higher-order" classes to which genes and proteins belong. The gene "SMK1" is a member of the group MAP Kinase, therefore MAP Kinase would also be tagged as a Gene-Entity. Map Kinase(s) is a member of the group Kinase, therefore Kinase would also be tagged as a Gene-Entity. Sets of genes such as gene families, gene super-families, protein families and the like all are conceptual representations of genes (with certain characteristics) and thus all should be tagged as gene-entities. (This too mirrors the organization of animals where species are a subset of genera, such that two species within the same "higher order" classification of genera have more in common than either does with a species in another genus: genes in gene-families similarly grouped. The genera are organized within families, families within orders and so forth.)
Let’s review what is and is not included in our definition of gene-entity so far. What we want to include are conceptual representations of genes (and proteins etc.). This includes genes, gene families, gene super-families, etc, as well as their protein and transcript counterparts. In short, this is everything between the most general term "gene", which really has no informational content, and specific instances of a gene as described in #2 below.
2. Specific manifestations of a gene. (This is not what we want to capture.)
Since a gene refers to a region of chromatin with an associated function, then in any individual there must be a specific sequence of nucleotides that one can point to which is an instance of this gene with a defined phenotype. That is, in any person there is a specific manifestation of the gene; this is the specific sequence of nucleotides that one can "read". We are no longer referring to an abstract notion but a specific instance of the concept with possibly its own, unique, characteristics. For genes, this specific sequence of nucleotides -- a physical instance of the gene -- is referred to as an allele.
Let’s look at a specific example. In Mendel’s peas, we are all familiar with the gene for seed color. Seed color (like eye color, or K-Ras) is a general abstract concept: a region on the chromosome that is associated with phenotype (in this case seed color). Now, if we look at peas we notice that there are two different seed colors possible, yellow seeds and green seeds. Each of these seed colors is the result of a specific allele for seed color and represents different sequences of nucleotides. That is, if we look in the region of the genome associated with the gene seed color, we will notice that the specific sequence of nucleotides differs between yellow and green seeds (these are specific and recognizable manifestations of the gene). Sometimes the differences between individuals (and alleles) are so subtle that no distinguishable phenotype exists, as in the case of "silent" substitutions.
For our non-biology example, this would be equivalent to "Mark’s 2003 Ferrari Modena". Remember, a "2003 Ferrari Modena" is an abstract concept referring to a particular type of car, whereas "Mark’s 2003 Ferrari Modena" refers to a specific instance to which we can point and refer. Therefore, "Mark’s 2003 Ferrari Modena" has a specific suite of characteristics (dings, dents, VIN number, etc.), which make it uniquely identifiable and distinguishable from "Yang’s 2003 Ferrari Modena".
At this point we need to return to the example "Bob’s K-Ras gene". Superficially it appears as though we are talking about a specific instance of K-Ras, namely K-Ras in Bob. Therefore, is it a conceptual entity (in which case we would tag it) or a specific instance (in which case we would ignore it)? Since Bob (and any other human) has paired chromosomes and thus two copies of nearly all genes (with some exceptions in men) then "Bob’s K-Ras gene" refers to a conceptual entity and NOT a specific instance of K-Ras. This is analogous to the situation we discussed above with seed color in peas; here, "Bob’s K-Ras" simply refers to the gene and not the specific alleles he may house. (There is a tangent to this discussion which I need to discuss, albeit briefly. Why aren’t we concerned with all of the millions of copies of K-Ras scattered about Bob’s cells? The answer is very simple. Since Bob started as a single fertilized zygote we make the assumption that every cell is identical in its genetic profile. More importantly, we will only care about a subset of cells with an altered genetic profile if it results in a tumor -- the very thing we are trying to capture in the project. Therefore, if a subset of cells has an alteration which results in malignancy this will be captured by first noting the gene, then the variation, and finally the resultant malignancy.)
Since we are not interested in specific instances of a gene (or protein or transcript), then we really are not interested in the parts of a gene either. Specific codons, exons, introns, nucleotide sequences, binding regions, functional domains etc. should NOT be tagged as gene entities. Why? Simple: because they do not correspond to the conceptual idea of a gene and thus do not inherit the "ideal characteristics" and "properties" of the gene.
So, why are we worrying about this distinction between the concept of a gene and specific manifestations? The answer is related to the overall project goals. Remember, the long-term focus of this project is to capture the relation "a gene-entity has some variation which results in malignancy". This is another way of saying we are only interested in linking together those genomic elements which, when mutated or changed, result in disease. Therefore, the only time we would ever be interested in specific instances of the same gene (i.e., differences between people, or the results of mutation) is when it is related to a malignancy, in which case that difference will be captured by the variation-entity. Again, differences in the same gene are only interesting when they relate to malignancy; furthermore, those differences are already being captured in the variation-entity class.
Genes, in general, are given two different "official" names. Both naming systems will be encountered in the text, and both should be labeled as gene-entities.
1. A symbolic representation.
Most genes have a symbolic representation (example, SMK1). This name is often a combination of letters and numbers which serves as an acronym for the longer descriptive name (described below in #2). Although there is a recognized convention for determining official gene symbols, it is rarely adhered to. Some people will use uppercase letters, whereas others will use lowercase letters. This conflates gene and protein nomenclature which often uses the same symbol in different forms, thus making it nearly impossible to differentiate genes and proteins solely by the symbol in text. For our purposes here that does not matter since genes and proteins are both gene-entities (in fact, this was part of our logic for grouping them together!).
2. A descriptive summary.
All genes have a descriptive name which summaries their function (more accurately, the function of the gene’s protein). For example, the gene SMK1 is also referred to as "sporulation MAP Kinase 1".
The following resources will be of use in recognizing gene names and symbols. However, these are not 100% complete and you will most likely encounter genes that need to be tagged but are not listed in the following resources. If you have any questions feel free to contact Yang Jin jin@genome.chop.edu.
2003-04-08; reformatted 2003-06-30