Resources for Biomedical Terminology and Ontology
This web page is no longer being maintained.
Please send all comments to Mark Mandel (mamandel@ldc.upenn.edu).
- Most entries have one or more comments. The source of each
comment is identified in parentheses before the comment (lowercase
initials for members of this project). An ID of "site" means the
comment is taken from the site itself. I have taken liberties in
quoting and editing, often silently, so "site" comments are not
guaranteed to be verbatim or even all taken from the same page of the
site.
- Some sites have an interface to vocabulary that we may find
useful. I have been testing them with "lactase" for the enzyme
sites. They are listed below.
- In general, entries are listed alphabetically within categories,
but the categorization doesn't claim to be precise. Projects have
software; bibliography may have or include papers. Especially, the
boundary between ontologies and databases is fuzzy. Entries within a
subcategory are indented.
- Where an entry is alphabetized by a non-initial part of its name
(usually something preceded by "University of") that word is in
boldface. In some subsections local alphabetization gives way to
internal structure without warning. This should not be a problem.
- A few entries have their name in parentheses; these may be
marginal for our purposes, per comments. I don't promise that
unparenthesized entries are not marginal! This is a superficial
survey.
- The date that the entry on this page was last updated may appear
at the end of a comment, usually the site comment, in the format
(02-09-27). This example stands for Sept. 27, 2002; undated
entries were added before that date.
Table of Contents
Vocabulary interfaces
These links point to the sites' entries on this page.
See also Theory.
-
BioMed Central
http://www.biomedcentral.com/info/about/datamining/
(site:) BioMed Central has so far published 25228 articles of
peer-reviewed biomedical research, all of which are covered by our
open access license agreement which allows free distribution and
re-use of the full-text article, including the highly structured XML
version. As a result, BioMed Central's research article corpus is
ideally suited for use by text mining researchers.
(07-06-26)
- GENIA Project
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
(mam:)
at Univ. of Tokyo's Tsujii Lab
(site:) seeks to automatically
extract useful information from texts written by scientists to help
overcome the problems caused by information overload. We intend that
while the methods are customized for application in the micro-biology
domain, the basic methods should be generalisable to knowledge
acquisition in other scientific and engineering domains. We are
currently working on the key task of extracting event information
about protein interactions. ...We are now developing a parser,
ontology, thesaurus and domain dictionaries as well as supervised
learning models.
(email from site 02-12-27:)
We have released the GENIA corpus version 3.0, consisting of 2000
MEDLINE abstracts, and annotated with technical terms and their
semantic classes, for download.
In this version, the terms inside other terms are also marked up. We
also have corrected sentence boundary errors.
Please see:
(03-01-28)
- PIR, Protein Information Resource
http://pir.georgetown.edu/pirwww/
(site:)
a division of the National Biomedical Research Foundation (NBRF) which is affiliated with Georgetown University Medical Center.
Databases:
- PIR-PSD (PIR-International Protein Sequence Database)
- iProClass Protein Classification Database
- PIR-NREF (PIR Non-Redundant Reference Sequence Database)
(02-10-01)
- Stanford Medical Informatics
http://www.smi.stanford.edu/
(mam:)
Stanford Univ., Russ Altman. They run RiboWEB.
(02-10-27)
(site:) an interdisciplinary
academic and research group within the Department of Medicine in the
Stanford University School of Medicine. SMI brings together scientists
who create and validate models of how knowledge and data are used
within biomedicine. Our research staff and students study new methods
for acquiring, representing, processing, and managing knowledge and
data within health care and the biomedical sciences.
- Textpresso: An Ontology-Based
Information Retrieval and Extraction System for Biological Literature
http://www.textpresso.org/
(mam:)
based on Wormbase. Also for N. crassa, and prototype sites for
Neuroscience and Fly literature.
(05-01-06)
(paper linked from site:)
a new text-mining system for scientific literature whose
capabilities go far beyond those of a simple keyword search
engine. Textpresso's two major elements are full text articles, and
categories of terms forming an ontology: classes of biological
concepts (e.g., gene) and classes that relate two objects
(association) or describe one (biological process). This ontology
(currently 33 categories) is used to mark up the whole corpus. The
user can search a sentence or document for one or a combination of
these tags and/or keywords and formulate semantic queries. Full text
access increases recall of biological data types from 45% to
95%. Identifies sentences nearly as well as expert curators. Currently
focuses on Caenorhabditis elegans literature, with 3,800 full
text articles and 16,000 abstracts. The lexicon contains 14,500
entries, each of which includes all versions of a specific word or
phrase, and all categories of the Gene Ontology database.
- TRESTLE Project, Text Retrieval, Extraction
and Summarisation for Large Enterprises
http://nlp.shef.ac.uk/trestle/
(mam:)
Univ. of Sheffield, Rob Gaizauskas et al.
(02-09-30)
(site:) The application of
state-of-the-art techniques for information extraction to a
large-scale, real-world situation, namely some newsfeeds used by
GlaxoSmithKline. From the academic point of view this gives a view on
which techniques stand a chance of real use, and which directions
pragmatic research could go in. From the commercial point of view a
state-of-the-art system gives new utility to existing information,
which is a competitive edge.
- (CTV3: Clinical Terms Version 3)
http://www.nhsia.nhs.uk/terms/pages/default.asp
(mam:)
Run by the Clinical Terminology Service of the British National Health
Service; included in the UMLS Metathesaurus,
according to NCI EVS
(http://ncicb.nci.nih.gov/NCICB/core/EVS/Vocabulary/Thesauri/)
(02-10-08)
(site:)
a set of files containing Read coded clinical concepts (and their
representative clinical terms) in a hierarchical relationship,
together with associated cross-references to the clinical
classifications. Additional information relating to the concepts and
to the terms is also included. The product is usually accompanied by a
number of Value Added Files (VAFs) and by Bonus Files as described
elsewhere on this site.
(02-10-08)
- Edinburgh Mouse Atlas: The Standard Anatomical Nomenclature Database
http://genex.hgu.mrc.ac.uk/Databases/Anatomy/
(interface:)
Two interfaces with the same
entry page:
(02-10-31)
- Current interface:
Non-java, indented-text anatomy -- This leads to a flat set of pages, one for each developmental stage, with an ontologically-indented series of terms in plain (preformatted) text.
- Under development:
Prototype XML - download the XML with a prototype DTD - not yet standard.
(site:) Our aim is to build a generally accepted nomenclature for the components of the mouse embryo, at successive stages from fertilization to adult... to provide a standardized vocabulary that will be a framework for describing patterns of gene expression and other processes occurring during normal and abberant development... to be used within the Mouse Gene-Expression Information Resource (MGEIR) and as part of the Edinburgh Mouse Atlas. The nomenclature has been implemented as an Object-Oriented Database (using ObjectStore) and as well as the names for all tissues this will include synonyms, groups, lineage (as far as it is known) and supplementary and reference information.... The project is in collaboration with the Department of Anatomy, University of Edinburgh.
Gene Ontology
http://www.geneontology.org
(site:) The goal of the Gene OntologyTM Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.
(interface:)
Entry to the ontologies:
- Term Definitions: This text file contains the available definitions for all defined terms within each of the ontologies.
XML format: On a monthly basis, XML format files are generated. Two files are available, one with gene associations and one without. Specific information on the file contents are available from the download page. The November 2002 (gzipped) file sizes total about 45MB.
(02-11-20)
(mam:)
Databank GO in SRS list; 11446 entries
(lu:) my favorite
(as:)
Contains ontologies for Molecular Function, Biological Process and
Cellular Component. --
GO is a database independent of any other. GO itself is not populated
with gene products of any organism, although tools can be built which
allow GO to be displayed as if it were. --
...so I won't find "filled-in" relations here.
iProClass Protein Classification Database
http://pir.georgetown.edu/pirwww/dbinfo/dbinfo.html
(interface:)
Entry to databases. This one seems most useful to us: (02-11-01)
(site:)
comprehensive descriptions of all proteins. Includes family relationships at both global (superfamily/family) and local (domain, motif, site) levels, as well as structural and functional classifications and features of proteins. The database is extended from ProClass, a protein family database that organizes proteins based on PIR superfamilies and PROSITE motifs. More than 809,000 non-redundant PIR-PSD, SwissProt and TrEMBL proteins
MeSH: National Library of Medicine's Medical Subject Headings
http://www.nlm.nih.gov/mesh/meshhome.html
(interface:)
(02-11-02)
(site:) MeSH is the National
Library of Medicine's controlled vocabulary thesaurus... provided
free to those who license MEDLINE and to those
obtaining MeSH via electronic means. (from Fact Sheet at http://www.nlm.nih.gov/pubs/factsheets/mesh.html
)
NCI EVS, National Cancer Institute Enterprise Vocabulary Services
http://ncicb.nci.nih.gov/NCICB/core/EVS
(mam:)
Basically internal to the NCI. They funnel externally created
vocabularies to NCI users, and the thesaurus/ontology they create is
integrated into UMLS.
(pw:)
The next best thing to the NCI
Metathesaurus, publicly available controlled vocabularies for many
of the cancer-specific datasets we would be interested in including.
(site:)
A set of services and resources that address NCI's needs for
controlled vocabulary. The NCI Thesaurus, a biomedical thesaurus
created specifically to meet the needs of the NCI, is available as a
stand alone vocabulary and as the NCI Metathesaurus, in which a
version of NCI Thesaurus is integrated with the UMLS Metathesaurus. In
addition, the NCI EVS provides NCI with MedDRA, SNOMED, ICD, MeSH, and many other standard controlled
vocabularies.
-- Some of this database content is ACCESS LIMITED. ["Guest" access is
available.]
(02-11-07)
SUMO, Suggested Upper Merged Ontology
http://www.ontologyportal.org
(e-mail from site:)
Although there isn't a great deal of biomedical
content specifically, there is a small ontology of Virus
information, which extends SUMO.
(04-12-18)
(site:)
SUMO and its domain ontologies form the largest formal public ontology
in existence today. Being used for research and applications in
search, linguistics and reasoning. The only formal ontology that has
been mapped to all of the WordNet lexicon. Written in the SUO-KIF
language; free and owned by the IEEE. The ontologies that extend SUMO
are available under GNU General Public License.
(05-01-06)
UMLS, Unified Medical Language System
http://www.nlm.nih.gov/research/umls/
(interface:)
Web based interaction tools and a programmer interface.
Web, download, and CD-ROM access information
here.
-- (as:) We will have UMLS installed locally shortly.
(02-11-02)
(lu:) a very large and slightly ugly (but useful) ontology
(mam:)
For a list of the sources in UMLS (2002), see here.
This list is organized by English vs. foreign language and licensing restriction category, and is derived from http://www.nlm.nih.gov/research/umls/license.html; the restriction categories are defined there.
It is a text file and may not word-wrap in your browser, so you may want to download it.
- AIDSDRUGS Structures
http://chem.sis.nlm.nih.gov/aidsdrg4.html
(interface:)
The page is a table of about 325 drugs. The first column, "Name", has lots of synonyms, a typical entry being "Dextran sulfate ; Dextran sulfate sodium ; Dextran sulfuric acid ester sodium salt ; Asuro ; Colyonal ; Dexulate ; Dextrarine ; MDS ; PF51".
(02-11-06)
(site:)
a table of substances referenced in the NLM AIDSDRUGS file.
- BIND, The Biomolecular Interaction Network Database
http://www.bind.ca/
(site:)
A database designed to store full descriptions of interactions, molecular complexes and pathways. Development has led to the incorporation of virtually all components of molecular mechanisms including interactions between any two molecules composed of proteins, nucleic acids and small molecules. Stats: interactions 6171; complexes 851; pathways 8.
(02-09-27)
- Non-Redundant B.subtilis database
http://pbil.univ-lyon1.fr/nrsub/nrsub.html
(site:)
This server allows to access the complete genome of Bacillus subtilis. Additional data on gene mapping and codon usage have been added, as well as cross-references with the SWISS-PROT, ENZYME and HOBACGEN databases.
(02-10-29)
- CrossFire Beilstein data: organic compounds [commercial]
http://www.mdl.com/products/xfirebeilstein.html
(site:)
The world's largest compilation of chemical facts. Cornerstone database to the organic chemistry literature. Offers scientists comprehensive detail, fast search speeds, and high-quality data indexed from over 180 journals.
(interface:)
MDL CrossFire Web, proprietary Java interface
Penn subscription (per pw):
http://www.library.upenn.edu/webbin5/resources/client-select.cgi?beilstein. --
(mam:) The Windows installation of Beilstein Commander, which seems to include the Gmelin database as well, occupies 31MB, using 48.1MB of disk space on my Windows 2000 machine.
(02-11-20)
(pw:)
There may be significant access obstacles for many of the medically-oriented
vocabulary sources. We may need some advice as how to best arrange for
access to semi-proprietary sources.
- Blocks
http://www.blocks.fhcrc.org/
(mam:)
Databank BLOCKS in SRS list; 4071 entries
(site:)
A service for biological sequence analysis at the Fred Hutchinson Cancer Research Center in Seattle, Washington, USA. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.
(02-10-01)
- Brenda enzyme information system
http://www.brenda.uni-koeln.de/
(interface:)
Clicking on a search link from the main page initiates a session with a generated URI. Several Web search tools, including a Synonyms search that lists EC number, recommended and systematic names, other synonyms (17 for "lactase", including 7 apparent product names), CAS registry number, and many types of biomedical information with references and links.
Login required, but main page says "Free access for University of Pennsylvania", so we may (speculation!) be able to arrange programmatic access.
(02-11-02)
(site:)
The main collection of enzyme functional data available to the
scientific community. Available free of charge for academic,
non-profit users. The enzymes are classified according to the Enzyme Commission list of enzymes. Some
3500 "different" enzymes are covered. Frequently enzymes with very
different properties are included under the same EC number. Being
developed into a metabolic network information system with links to
Enzyme expression and regulation information.
- (CAS (Chemical Abstracts Service) Registry)
http://www.cas.org/EO/regsys.html
(interface:)
Apparently their target is commercial developers. I can't tell from their
descriptions whether their interfaces would be useful to us at all.
(02-11-02)
(site:) The largest and most current database of chemical substance information in the world containing more than 41 million substance records, including the world's largest collection of Organic compounds, Inorganic compounds, Metals, Alloys, Minerals, Coordination compounds, Organometallics, Elements, Isotopes, Nuclear Particles, Proteins and Nucleic Acids, Polymers, and Nonstructurable materials (UVCBs).
- ChemIDplus
http://chem.sis.nlm.nih.gov/chemidplus/
(site:)
A database of 350,000 chemical records, including over 100,000 with structures. Locator links allow immediate searching of other databases for information about a given chemical.
(interface:) (as:)
When I checked last Spring the National Library of Medicine was preparing a version of the ChemIDPlus database for download later in 2002. I got this information from the same place I downloaded Medline (the nlm download site). -- (mam:) Viewing 3-D structures requires a free plug-in.
(02-11-06)
(mam:)
The site mentions SuperList,
"a set of data in the NLM ChemIDplus file which carries data from selected regulatory or scientific Lists on which the specific substance appears", but it doesn't appear to be directly accessible. Perhaps the downloadable version will include it.
- ChemNet online chemical dictionary
http://www.chemnet.com/dict/
(interface:)
Enter term on main page. "Lactase" yields 14 synonyms in all.
(02-11-02)
(site:) with over 300,000 chemical entries, provides you with a simple interface ... Each entry yielded by your query will identify a chemical name, CAS Registry Number, and chemical synonyms. You may search the dictionary using any of these three fields.
- (DDBJ, DNA Data Bank of Japan)
http://www.ddbj.nig.ac.jp/
(mam:)
Duplicates EMBL and GenBank, if the quote below is accurate, but also provides tools.
(site:)
DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. We collect data mainly from Japanese researchers, but of course accept data and issue the accession number to researchers in any other countries. Since we exchange the collected data with EMBL/EBI and GenBank/NCBI on a daily basis, the three data banks share virtually the same data at any given time.
We also provide worldwide many tools for retrieval and analysis developed at DDBJ and others.
(02-09-30)
- E. coli Genome Project
http://www.genome.wisc.edu/
(mam:)
University of Wisconsin at Madison. Research use only; see
http://www.genome.wisc.edu/tools.htm
(site:)
Originally established to complete the sequence of the Escherichia coli K-12 genome. Building upon that foundation, we continue to maintain and update that annotated sequence, and are currently working on the functional characterization of E. coli K-12 genes and their regulation. We have also begun sequence analyses of several pathogenic E. coli strains and other pathogenic Enterobacteriaceae, with the goal of characterizing a gene pool of virulence determinants -- the "pathosphere".
(02-10-28)
- EMBL Nucleotide Sequence Database
http://www.ebi.ac.uk/embl/
(mam:)
Databank EMBL in SRS list; 18324138 entries in 12-Sep-20 release. Comprises Release, Updates, Whole Genomes Shotgun sequences, and a very few Third Party Annotation sequences.
(site:) Europe's primary nucleotide sequence data resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. Produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ)
- Enzyme Commission: Nomenclature Committee of the International Union of Biochemistry and Molecular Biology
http://www.chem.qmul.ac.uk/iubmb/enzyme/
(interface:)
This may be too awkward and time-consuming for our annotators to use;
ExPASy, which is "primarily based" on this, seems a much better bet.
The nomenclature
interface returns links to all entries showing any match. Thus, a search for "lactase" in "biochemical nomenclature" returns six hits: the enzyme whose common name is "lactase"; another that is also called "lactase"; two whose names include the word; one with a reference to a document whose title includes it; and a list of all the EC 3.x enzymes, the hydrolases.
(site:)
The complete contents of Enzyme Nomenclature, 1992 (plus subsequent supplements and other changes) are listed in enzyme number order giving just the recommended name. Each entry provides a link to details of that enzyme. (02-10-01)
- EMP Enzymes and Metabolic Pathways database
http://www.empproject.com/
(interface:)
Entry to the nomenclature interface, comprising
Enzyme Nomenclature,
Compound Nomenclature,
Taxonomy, and
Transport Protein Classification. There are two views, Tree and List. In the List view of Enzymes, "Lactase" yields the usual two hits, lactase and beta-galactosidase, with references and so on. The Tree view is an outline-style view into the Enzyme Commission ontology, which the user expands level by level down to the specific hits. Unlike the EC search engine, this site seems to restrict its hits to synonyms, an advantage for us.
(02-11-02)
(site:)
a unique and most comprehensive electronic source of biochemical data. It covers all aspects of enzymology and metabolism and represents the whole factual content of original journal publications.
- EPD, the Eukaryotic Promoter Database
http://www.epd.isb-sib.ch/
(mam:)
Databank EPD in SRS list; 1405 entries.
(SRS page:)
A specialized annotation database of the EMBL Data Library. Provides information about eukaryotic promoters available in the EMBL Data Library; intended to assist experimental researchers, as well as computer analysts, in the investigation of eukaryotic transcription signals. Organized as a hierarchically ordered and documented "functional position set" pointing to transcription initiation sites. All information is directly abstracted from scientific literature and is thus independent of the EMBL sequence entry descriptions. As a consequence, many of the initiation sites referred to in EPD do not appear in corresponding EMBL feature tables.
(02-11-02)
- ExPASy ENZYME database
http://www.expasy.ch/enzyme/
(interface:)
Search from main page by official or alternative name or EC number. "Lactase" produces two hits.
(02-11-02)
(site:) ENZYME is a repository of information relative to the nomenclature of enzymes. It is primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) and it describes each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided.
- FlyBase Database of the Drosophila Genome
http://flybase.bio.indiana.edu/
(interface:)
Terminology tree for fly body parts, which however I don't expect to be very useful to us.
(02-11-02)
(as:)
I think the molecular function and process might have more examples in
the literature that are missed in the database than the "Expressed in"
and Mutants Affect relation.
(site:)
A database of genetic and molecular data for Drosophila. Includes data on all species from the family Drosophilidae; the primary species represented is Drosophila melanogaster. Copying in whole or part for commercial uses requires written consent. Copying for non-commercial, scientific uses is permitted. Other copyrights pertain to portions of FlyBase from other sources.
- GenBank
http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
(interface:)
Nomenclature page links to a number of standards and lists, some of which (e.g.,
Mouse Database: Genetic Markers and Synonyms) are potentially useful to us.
(02-11-02)
(site:)
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Approximately 22,617,000,000 bases in 18,197,000 sequence records as of August 2002. Part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
(02-09-30)
- HGNC, the Human Gene Nomenclature Database
http://www.genenames.org/
(interface:) (sd:)
Gene symbols, will provide aliases for descriptions and gene symbols.
Can download a copy, should this be more tightly integrated with
WordFreak?
(02-12-17)
(mam:)
Run by HGNC,
the HUGO Gene Nomenclature Committee.
(HUGO = The Human Genome Organisation.)
(2007-09-19)
(site:)
We have approved over 24,000 human gene symbols and names. Each symbol is unique and we ensure that each gene is only given one approved gene symbol. Search the HGNC database for your gene. (2007-09-19)
- (GenPept Database Genetic Sequence Data Bank)
http://inn.weizmann.ac.il/databanks/genpept.html
(mam:)
This site doesn't seem to have a link to this data bank, just the description.
(02-11-02)
(site:)
Previous releases of GenPept were produced by translating the GenBank flat file release. Beginning with Release 70, GenBank entries include a translation of each valid CDS. This information is associated with each CDS through the addition of a /translation qualifier in the GenBank Feature Table. The GenPept amino acid sequence is simply copied from the value of the /translation qualifier. We continue to produce GenPept because its data format is useful for similarity searching with existing software.
- Crossfire Gmelin database: inorganic and organometallic compounds
http://www.mdl.com/products/xfiregmelin.html
(interface:)
Penn subscription (via pw): http://www.library.upenn.edu/webbin5/resources/client-select.cgi?gmelin
(mam:) The Windows installation of Beilstein Commander, which seems to include the Gmelin database as well, occupies 31MB, using 48.1MB of disk space on my Windows 2000 machine.
(02-11-20)
(site:)
the world's most comprehensive data collection in organometallic and inorganic chemistry, covering literature from the year 1772 to today. Contains 1.6 million compounds, 1.3 million structures, 1.3 million reactions, and 900,000 citations, including titles and abstracts from 1995. Fully searchable by structures, substructures, and reactions. The data are indexed from 61 journals.
- Go Annotation, GOA@EBI
http://www.ebi.ac.uk/GOA/
(mam:)
Databank GOA in SRS list; virtual library whose components, GOASPTR and GOAHUMAN, have 2193016 and 77743 entries respectively. I can't tell from the descriptions how the two components differ.
(site:) In the GOA project, the GO vocabulary will be applied to a non-redundant set of proteins described in the SwissProt, TrEMBL and Ensembl databases that collectively provide complete proteomes for Homo sapiens and other organisms. -- In the first stage of this project, GO assignments have been applied to a data set representing the human proteome by a combination of electronic mappings and manual curation.
(EBI info page:)
Provides associations of terms from the Gene Ontology (GO) with entries in protein sequence databases (SWISS-PROT, TrEMBL and Ensembl). The project currently covers all GO annotations that exist in SWISS-PROT and TrEMBL and includes annotation for the SWISS-PROT/TrEMBL/Ensembl non-redundant human proteome set.
- GRID, General Repository for Interaction Datasets
http://biodata.mshri.on.ca/grid/servlet/Index
(site:)
a database of genetic and physical interactions developed at Mount Sinai Hospital, Toronto, Ontario. Interaction data from many sources, including several genome/proteome-wide studies, the MIPS database, and BIND.
(as:)
field entries come from the Gene Ontology
(02-09-27)
- HOBACGEN: Homologous Bacterial Genes Database
http://pbil.univ-lyon1.fr/databases/hobacgen.html
(site:)
All the protein sequences of bacteria organized into families. Particularly useful for comparative genomics, phylogeny and molecular evolution studies on bacteria. Contains all sequences of bacteria (eubacteria and archeae) and yeast taken from SWISS-PROT + TrEMBL and EMBL, with some annotation modifications to incorporate complementary data related to families and protein domains.
(02-10-29)
- IMGT/LIGM-DB
http://www.ebi.ac.uk/imgt/
(mam:)
Databank IMGT/LIGM-DB in SRS list; 62884 entries
(interface:)
Taxonomy search;
links to taxonomies.
(02-11-02)
(site:)
LIGM-DB, a comprehensive database of Immunoglobulins and T cell receptors from human and other vertebrates, with translation for fully annotated sequences. The IMGT server (Montpellier, France: http://imgt.cines.fr/) provides, via an easy to use and friendly interface, a common access to all Immunogenetics data, including the IMGT Repertoire based on the IMGT Scientific chart.
- IMGT/HLA
http://www.ebi.ac.uk/imgt/hla/
(mam:)
Databank IMGTHLA in SRS list; 1546 entries
(site:)
Part of IMGT project; provides a specialist sequence databases for sequences of thehuman major histocompatibility complex (HLA). This includes all official sequences for the WHO HLA Nomenclature Committee For Factors of the HLA System.
- IPI, International Protein Index
http://www.ebi.ac.uk/IPI/IPIhelp.html
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/
(mam:)
Databank IPI in SRS list; 78584 entries.
A cross-reference guide for "the main databases that describe the human and mouse proteomes: SWISS-PROT, TrEMBL, RefSeq and Ensembl".
(02-11-02)
- InterPro database, Integrated Resource of Protein Domains and Functional Sites
http://www.ebi.ac.uk/interpro/
ftp://ftp.ebi.ac.uk/pub/databases/interpro/
(mam:)
Databank INTERPRO in SRS list; 5875 entries
(SRS:)
An integrated documentation resource for protein families, domains and functional sites. Includes the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects, and now also the SMART database.
[They also reference TIGRFAMS.]
Where applicable, entries contain cross-references to the BLOCKS database, and mappings to the appropriate Gene Ontology terms. The InterProMatches data contains all matches of InterPro member database protein signatures against SwissProt and TrEMBL. InterProScan allows users access to a wider, complementary range of site and domain recognition methods in a single package, providing a useful tool for protein annotation.
- (IUPAC Nomenclature Recommendations)
http://www.chem.qmul.ac.uk/iupac/
(mam:)
The site's actual title is International Union Of Pure And Applied Chemistry: Recommendations on Organic & Biochemical Nomenclature, Symbols & Terminology etc.
Although it looks promising, this site consists of recommendations on how to name chemical entities, and includes actual names mostly as examples.
Here are two examples that I came across by other routes:
(02-11-02)
- (IUPAC Lexicon of lipid nutrition)
http://www.iupac.org/publications/pac/2001/pdf/7304x0685.pdf
(mam:)
Definitions of 750-800 terms. There is a very small (~20 entries) table of synonyms for fatty acids on p. 46.
(02-11-02)
(site:)
Pure and Applied Chemistry Vol. 73, Issue 4 -- Joint Committee of International Union of Nutritional Sciences and IUPAC Commission On Food: Lexicon of lipid nutrition (IUPAC Technical Report), J. Beare-Rogers et al.
- (IUPAC Nomenclature of Organic Chemistry)
http://www.acdlabs.com/iupac/nomenclature/
(mam:)
How to name a compound, not names of compounds per se. What's more, the search engine (Excite!) is dead. Whether you search for "lactase", "benzene", or "hydrogen", you get "Error: No Index to query on."
(02-11-02)
(site:) for certain types of compounds, there is significant disagreement among chemists in different fields as to what should be the preferred nomenclature. In this Guide, the primary aim is to provide directions for arriving at an unambiguous name.
- KEGG, Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
(as:)
Basically it looks like using KEGG if possible requires a little more
thought about how to define the problem. ... Random Thought -- This is another excellent resource for compound names.
(site:)
an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects. Based at Bioinformatics Center, Institute for Chemical Research, Kyoto University
(02-09-27)
- Merck Index
http://www.merck.com/pubs/mindex/online.html
(interface:)
See site comment.
(02-11-11)
(pw:)
For chemical compounds, I don't see the Merck Index available at the Penn
library, although it is available via subscription here.
(site:)
The web accessible version of The Merck Index, Thirteenth Edition (2001). Contains the text and structures of the monographs, the supplementary tables section and the Organic Name Reactions section. This product features powerful text and substructure searching tools for exploring the database.
- MGD, Mouse Genome Database
http://www.informatics.jax.org/
(as:)
Phenotype classification terms are provided by MGD.
... an example where curation of the database includes a natural
language summary (which we won't be getting into yet).
(interface:)
Search via "quick forms" for an identifier, or a fuller search that may yield matches in multiple categories.
(02-11-04)
(site:)
Contains information on mouse genetic markers, molecular segments, phenotypes, comparative mapping data, experimental mapping data, and graphical displays for genetic, physical, and cytogenetic maps.
- MHCBN, Comprehensive Database of MHC Binding and Non-Binding Peptides
http://www.imtech.res.in/raghava/mhcbn/
(mam:)
Databank MHCBN in SRS list; 19777 entries
Unable to access page 02-11-05
(site:)
A curated database of Major Histocompatibility Complex (MHC) Binding, Non-binding peptides and T-cell epitopes.
- MIPS, Munich Information Center for Protein Sequences
http://mips.gsf.de/
(site:)
Main focus is the genome oriented bioinformatics, in particular the systematic analysis of genome information including the development and application of bioinformatics methods in genome annotation, expression analysis and proteomics. MIPS supports and maintains a set of generic databases as well as the systematic comparative analysis of microbial, fungal, and plant genomes. -- MIPS Databases and associated information are protected by copyright. Commercial users should contact Biomax Informatics AG for license rights to use. Only academic and non-commercial users are permitted access to the this server without a commercial license. Use of this server for commercial purposes, distribution of MIPS database files to third parties, and the distribution of parts of files or derivative products to any third parties is prohibited.
(02-09-27)
- Mutation Nomenclature
http://www.genomic.unimelb.edu.au/mdi/mutnomen/
(mam:)
Nomenclature for the description of sequence variations: a proposal-in-development for systematically naming variations in genetic material. By the
Human Genome Variation Society
(formerly the HUGO Mutation Database Initiative).
(03-01-21)
(pw:)
This is a description of the standard form that all events should be
reported with. Unfortunately for us, it is very poorly adhered to in the
literature, and a standard form is not enforced or even recommended by most
major journals. Of course, that may be good news for you
(site:)
Recently, a nomenclature system has been suggested for the description of sequence variants (mutations, polymorphisms) in DNA and protein sequences. These nomenclature recommendations have now been largely accepted and they stimulated a uniform and unequivocal description of sequence variants. Current rules however do not yet cover all types of variants nor do they cover more complex changes. What is listed here should represent the current consensus of the discussions. These pages should be used as a guide to describe any sequence variant and ultimately evolve into a uniformly accepted standard.
- NDC: National Drug Code Directory
http://www.fda.gov/cder/ndc/database/default.htm
(interface:)
There are electronic data files of currently marketed products. The file layout description describes the files content.
ZIPTEXT.EXE is a zipped executable file that can be downloaded and, using a PC database management system such as Access, be formatted for query and reporting. (02-11-20)
(mam:)
Useful if we want an index by trade names. No hits for "lactase" as ingredient, 10 for "lactose".
(site:)
The NDC serves as a universal product identifier for human drugs. Limited to prescription drugs and a few selected OTC products that have completed the listing process. Information as reported by the listing firm. Each drug product listed is assigned a unique 10-digit, 3-segment number, known as the National Drug Code (NDC), in one of the configurations 4-4-2, 5-3-2, or 5-4-1: labeler code (firm that manufactures, repacks or distributes a drug product), product code* (strength, dosage form, and formulation for a particular firm), package code* (package sizes).
(For consistency, other Government agencies may display the NDC in an eleven digit format, e.g., 5-4-2 with leading zeros.) --
*Assigned by the firm.
- NCI Metathesaurus
http://ncimeta.nci.nih.gov/indexMetaphrase.html
(interface:)
Browser returns the expected two hits for "lactase", each with its own list of synonyms and ontological links. -- (pw:) It is available to us either as tab-delimited files or XML. It also can be accessed via an API if we need it and someone is
interested in doing so. -- NCI is happy to help to support our project.
(pw:)
This integrates many of the controlled vocabularies we are interested in and
is almost perfect for our cancer project.
This resource has already accomplished what we are
currently trying to do regarding entity classification/hierarchy structure.
To get an idea of the classification scheme, go to:
http://ncimeta.nci.nih.gov/indexMetaphrase.html
and click on "Browse" at right.
-- However, the controlled vocabularies for each entity class are
cancer-specific, so while the structure is applicable for other biomedical
projects, the vocabularies are not. The NCI is sending me more documentation [02-11-12]
regarding the resource to me which I'll pass on when I receive it. My
understanding is that there will be no licensing issues as long as we are
not providing it as a commercial service and get permission from the
contributing sources-- NCI has offered to help us with that.
(02-11-21)
(site:)
a comprehensive biomedical terminology database, currently containing 850,000 concepts mapped to 1,500,000 terms by over 4,500,000 relationships. Produced by the NCI Center for Bioinformatics Enterprise Vocabulary Service. The public version currently contains all public domain vocabularies from the National Library of Medicine's UMLS Metathesaurus, as well as a growing number of NCI-specific vocabularies developed by the National Cancer Institute
- (Patent listings accessible through the SRS list)
Probably all of limited use to us, at best.
- (Patent_prt, a library of patented protein sequence from the European Patent Office in the public domain)
http://www.es.embnet.org/Services/ftp/databases/embl/patent/??
(mam:)
Databank PATENT_PRT in SRS list; 156622 entries
(02-10-07)
- (Jpo_prt, a library of patented protein sequences from the Japanese Patent Office in the public domain)
(no URL found)
(mam:)
Databank JPO_PRT in SRS list; 26807 entries
(02-10-07)
- (Patent_Dna, a library of patented DNA sequences from the European Patent Office in the public domain)
(no URL found)
(mam:)
Databank PATENT_DNA in SRS list; 625664 entries
(02-10-07)
- (USPO_PRT, a library of patented protein sequence from the American Patent Office in the public domain)
(no URL found)
(mam:)
Databank USPO_PRT in SRS list. 154302 entries
(02-10-08)
- PDB, Protein Data Bank
http://www.rcsb.org/pdb/
(site:)
worldwide repository for the processing and distribution of 3-D biological macromolecular structure data; operated by Rutgers University, the San Diego Supercomputer Center at UCSD, and NIST
(02-10-01)
- Pfam, Protein families database of alignments and HMMs
http://www.sanger.ac.uk/Software/Pfam/index.shtml
(site:)
a collection of protein families and domains containing multiple protein alignments and profile-HMMs of these families. A semi-automatic protein family database, which aims to be comprehensive as well as accurate.
(02-10-01)
- PIR-NREF, PIR Non-Redundant Reference Protein Database
http://pir.georgetown.edu/pirwww/search/pirnref.shtml
(interface:)
Web search for protein name = "lactase" yielded 140 hits, not a useful source for synonyms. But the database (free download) may be amenable to searches more focused on our purposes; or we may find that other sources already give better access to synonymies or ontologies of the same underlying data.
(02-11-05)
(site:)
PIR-NREF current release 1.09, 4-Nov-2002 contains 1,042,859 entries.
Available for free downloading and redistribution from our FTP site in XML format (data file) and FASTA format (sequence file).
Contains all sequences in
PIR-PSD, SwissProt,
TrEMBL, RefSeq,
GenPept, and PDB. Identical sequences from the same source organism (species) reported in different databases are presented as a single NREF entry with protein IDs and names from each underlying database.
(02-09-30)
- PIR-PSD, PIR-International Protein Sequence Database
http://pir.georgetown.edu/pirwww/dbinfo/pirpsd.html
(site:)
the most comprehensive and expertly annotated protein sequence database in the public domain. Primary sources are naturally occurring wild-type sequences from GenBank/EMBL/DDBJ translations, published literature, and direct submissionl. Comprehensive coverage across the entire taxonomic range, including sequences from publicly available complete genomes.
(02-09-27)
- PlasmoDB, The Plasmodium genome database
http://plasmodb.org/
http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/map00?taxid=5833
(mam:)
New release and local: The Plasmodium Genome Database Collaborative, The Departments of Biology and Genetics, Center for Bioinformatics and Genomics Institute, University of Pennsylvania;
Dr. David Roos, droos@sas.upenn.edu
(aa:)
[referring to earlier version] I don't see anything here yet that we can use. I will ask whether there is anything they can recommend.
(site:)
PlasmoDB 4.0 provides a greatly enhanced and expanded database, released in conjunction with publication of the complete parasite genome sequence
(3 Oct 2002 issue of Nature). Incorporates DNA sequence data and curated annotations from the genome sequencing centers; ... and other genomic-scale information relevant to malaria research. A variety of tools are available. Based on a relational database management system with a rich schema.
(02-10-04)
- PRINTS Protein Fingerprint Database
http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
(site:)
a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT/TrEMBL composite. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbours
(02-10-01)
- ProDom Protein Domain Database
http://prodes.toulouse.inra.fr/prodom/doc/prodom.html
(mam:)
Databank PRODOM in SRS list; 271051 entries.
Built from SwissProt and TrEMBL
I haven't found an explanation of what this is about. Of course, these websites are aimed at biologists, who may have no trouble at all understanding them.
(site:)
The Protein Domain database, ProDom, release 99.2, has been constructed by clustering homologous segments derived from non fragmentary sequences from SWISS-PROT 37 + TREMBL + TREMBL updates - April 26, 1999.
The current version of ProDom...
1382 ProDom families were generated automatically using PSI-BLAST with a profile built from the seed aligments of Pfam-A 3.4 families. In addition a new expertise procedure has been introduced to validate some domain boundaries. (02-10-01)
- PROSITE Database of protein families and domains
http://us.expasy.org/prosite/
(mam:)
Databank PROSITE in SRS list; 1568 entries
(site:)
consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs (02-10-01)
(SRS:)
The PROSITE data bank is composed of two ASCII files. PROSITE.DAT contains all the information necessary to programs that will scan sequence(s) with patterns and/or matrices. PROSITE.DOC contains textual information that fully documents each pattern and profile. We strongly urge software developers to build software tools that make use of both files.
- Rebase, The Restriction Enzyme Database
http://www.neb.com/rebase
(mam:)
Databank REBASE in SRS list; 4305 entries.
When I tried to access the website I got a 403 (Forbidden) error message:
"You are not authorized to view this page. You might not have permission to view this directory or page using the credentials you supplied."
(SRS page:)
The Restriction Enzyme Database is a collection of information about restriction enzymes, methylases, the microorganisms from which they have been isolated, recognition sequences, cleavage sites, methylation specificity, the commercial availability of the enzymes, and references - both published and unpublished observations (dating back to 1952).
(02-10-06)
- RefSeq, NCBI Reference Sequence project
http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html;
ftp://ncbi.nlm.nih.gov
(mam:)
Databank REFSEQ in SRS list; 91453 entries
(site:)
The NCBI Reference Sequence project will provide reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins, to provide sequences that can be more easily maintained, and which avoid the redundancy often present in the GenBank database archive. RefSeq projects use distinct processes to provide different types of records. Currently three primary projects:
- Curated RefSeq (regions, transcripts and proteins)
- Genome Annotation (contigs, transcripts, and proteins)
- Complete Genomes (genomes, chromosomes, and proteins)
(site representative:)
Refseq is the nucleic acid reference sequence database produce by the
NCBI using data from the tri-lateral collaboration between EMBL, DDBJ
and GenBank.
(02-09-27)
- RefSeqP, the protein section of RefSeq
ftp://ncbi.nlm.nih.gov/refseq/cumulative/
(site representative:)
[When I first looked at EBI's listings for REFSEQ and REFSEQP the descriptions were identical apart from the URIs and the number of entries of each type.
I asked and received this answer. -- mam]
Refseqp is the protein sequence equivalent to refseq. Refseq contains
reference genome sequences from many sources. In some cases a single
entry can contain a whole chromosome or even an entire genome. These
sequences in turn can contain several thousand coding regions which in
turn translate to protein sequence. It is because of this that there are
many more protein sequences in refseqp than DNA sequences in refseq.
(02-09-27)
- RiboWEB
http://www-smi.stanford.edu/projects/helix/riboweb/kb-pub.html
(site:)
A knowledge base containing templates (or structured representations) for the most common experiments used to study the structure of RNA/Protein complexes. It is currently populated with information (almost 5800 individual observations) from 150 articles about the 30S ribosomal subunit in procaryotes. Produced by members of the Altman lab at Stanford Medical Informatics.
(02-10-27)
- (SGD, Saccharomyces Genome Database)
http://genome-www.stanford.edu/Saccharomyces/
(as:)
Lots of info, but I can't find the sort we would hope to extract from
text. Anybody else have ideas here?
(site:)
A scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast.
(02-09-27)
- SANSS, Structure and Nomenclature Search System database [commercial]
http://www.nisc.com/cis/details/sanss.htm
(mam:)
by CIS (Chem. Info. Sys.), now owned by NISC (Natl Info Svcs Corp.)
(site:) "SANSS is designed to contain an entry for each compound included in the other individual CIS databases. It also provides cross-reference referral capabilities to many other sources of chemical information, enabling you to find additional data that may not be available online through CIS." Lists: CAS Registry Number, Chemical Abstracts Service name, Synonyms and trade names, Molecular formula, Molecular weight, Structural diagram
- SMART, Simple Modular Architecture Research Tool
http://smart.embl-heidelberg.de/
(site:)
allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteinse. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues.
(02-10-01)
- SMILES
http://dtp.nci.nih.gov/docs/3d_database/structural_information/smiles_strings.html
(site:)
This database contains essentially all open structures in the NCI database up until about June, 1995. It includes metal-containing compounds and other 'weird stuff'. It is therefore up to the user to ascertain the usefulness of any of these SMILES strings for the intended purpose.
(02-11-14)
(as:)
the nicest list I have seen so far. The web page says that there are 45,228 compounds. The file I downloaded has 216,089 lines. I guess this means there are repeats or redundant names. Note that this is an NCI database, so the contents may be skewed towards cancer related compounds.
(interface:) (mam:)
You can FTP the database in either of two formats. --
(as:)
Here are some sample lines:
1 2-Methylbenzoquinone-1,4
1 2-Methylquinone
1 2,5-Cyclohexadiene-1,4-dione, 2-methyl- (9CI)
2 Accel TM
2 Altax
2 Benzothiazol-2-yl disulfide
2 Benzothiazole disulfide
- SNOMED, Systematized Nomenclature of Medicine [commercial]
http://www.snomed.org/
(mam:)
Root hierarchies of possible interest to us include
- Body Structures
- Biological Functions
- Living Organisms
- Substances (Chemicals & Drugs)
(site:)
SNOMED version 3.5 allowed clinicians to index clinical findings, including morphologic changes and living organisms, patient signs and symptoms, reason for visit, occupation, and resulting diagnoses. SNOMED's depth of coverage includes a range of diagnostic and therapeutic procedures and treatments that enable one to document the continuum of care.
(02-10-04)
- SwissProt knowledgebase
http://www.expasy.ch/sprot/sprot_details.html
(mam:)
used by ExPASy
(site:) a protein knowledgebase maintained by the Swiss Institute of Bioinformatics and the EMBL Outstation - The European Bioinformatics Institute (EBI). The SWISS-PROT protein knowledgebase consists of sequence entries composed of different line-types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria: Annotation, Minimal Redundancy, and Integration with Other Databases.
- Swiss-Prot nomenclature document list
http://us.expasy.org/sprot/sp-docu.html#nomenclature
(mam:)
About a dozen links to sources for nomenclature in specific areas such as "Blood group antigens proteins" and "Nomenclature and index of allergen sequences". Some of them are lists of names, but others are just statements of nomenclature policies or conventions. I particularly recommend
Definitions of the terminology for ambiguities
(02-11-20)
(interface:)
Most of these are small text files.
- TIGRFAMs protein families database
http://www.tigr.org/TIGRFAMs/
(mam:)
The Institute for Genomic Research
(site:)
TIGRFAMs are protein families based on Hidden Markov Models or HMMs. Use this page to see the curated seed alignmet for each TIGRFAM, the full alignment of all family members and the cutoff scores for inclusion in each of the TIGRFAMs.
(02-10-01)
- (TrEMBL)
http://us.expasy.org/sprot/
(mam:)
a temporary "buffer" db?
(site:) TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.
- Binghamton University Library guide to chemical nomenclature
http://library.lib.binghamton.edu/subjects/chemistry/nomen.html
(mam:) biblio
- EBI, European Bioinformatics Institute SRS
http://srs.ebi.ac.uk
(Bioinformatics 2002 Aug, from MedLine:) SRS has become an integration system for both data retrieval and sequence analysis applications. The EBI SRS server is a primary gateway to major databases in the field of molecular biology produced and supported at EBI as well as European public access point to the MEDLINE database provided by US National Library of Medicine (NLM). It is a reference server for latest developments in data and application integration. The new additions include: concept of virtual databases, integration of XML databases like the Integrated Resource of Protein Domains and Functional Sites (InterPro), Gene Ontology (GO), MEDLINE, Metabolic pathways, etc., user friendly data representation in 'Nice views', SRSQuickSearch bookmarklets.
(mam:)
Their top page lists their databases by category.
(02-10-31)
- Ingenta Biotechnology Resources [commercial]
http://www.ingenta.com/isis/browsing/VisitSubjectResource/ingenta?subject=159
(site's Home:) Since its launch in May 1998, Ingenta has developed and grown to become the leading Web infomediary empowering the exchange of academic and professional content online. Now, with the acquisition of Catchword, Ingenta supplies unsurpassed access to 5,400+ full-text online publications and 26,000+ publications
- (Univ. of Luebeck: Medical Terminology [List and index])
http://www.medinf.mu-luebeck.de/~ingenerf/terminology/
(site:)
This collection of annotated links to internet resources provides a general view of what is accessable and available online in the field of Medical Terminology.
(mam:)
Apparently unavailable:
HTTP 404 - Datei nicht gefunden.
Try again later. (02-11-02)
It includes a page of
Basic Sciences links: Terminology, Ontology, Artificial Intelligence, Knowledge Representation, Computational Linguistics, and Information Retrieval. Their domain of interest seems to be more medical practice than biomedical research.
- U Penn Library Resource Guide: Information on Chemical Nomenclature
http://www.library.upenn.edu/scitech/chemistry/guides/chemnom.html
(mam:) bibliography
- (Pharma-Lexicon)
http://www.pharma-lexicon.com/
(site:)
Look up medical & pharmaceutical acronyms and abbreviations from our database of over 56,000. For our advanced searching system why not become a member? (02-10-04)
(mam:)
Web lookup: Medical Abbreviations (FDA, CJD, ...), Pharmaceutical Companies ["Glaxo" hit 38 subsidiaries(?) with "Glaxo" in their names, but "GSK" drew a blank], Associations, Medical Articles [uses Medline], Drug Search ["Sudafed" got 9 hits with their generic names, mostly like SUDAFED SEVER (a different drug) and including the generic STAVUDINE (by fuzzy matching)], Merck Manual, Medical Books, Public Health Image Search, Clinical Trials, English Dictionary, Thesaurus, Google Search.
- SHIGEN, Shared Information of Genetic Resources
http://www.grs.nig.ac.jp/
(mam:)
Run by the Japanese Institute of Genetics. (02-10-02)
(site:)
- SRS: see EBI
- Univ. of Waterloo Library Guide to Chemical Nomenclature and Terminology
http://www.lib.uwaterloo.ca/libguides/8-6.html
(mam:) includes bibliography of printed standards and rules
- PubMed/Medline
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
(site:) PubMed, a service of the National Library of Medicine, provides access to over 12 million MEDLINE citations back to the mid-1960's and additional life science journals. PubMed includes links to many sites providing full text articles and other related resources.
- Apollo Genome Browser
http://www.ensembl.org/apollo/
(site:)
Apollo is a collaborative project between the Berkeley Drosophila Genome Project (www.bdgp.org) and Ensembl. The collaboration was set up to create a tool to initially annotate fly but which would also be able to annotate and browse any large eukaryotic genome. There is a sister developers' website at www.fruitfly.org/annot/apollo to download the fly specific apollo annotation tool. All the code is open source and freely downloadable.
- BLAST®, Basic Local Alignment Search Tool
http://www.ncbi.nlm.nih.gov/BLAST/
(site:)
a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA.
(02-10-01)
- caCORE: The NCICB Cancer Informatics Infrastructure Backbone
http://ncicb.nci.nih.gov/core
(mam:)
Run by the The NCI [National Cancer Institute] Center for Bioinformatics
(site:)
The caCORE packages are intended to serve as an infrastructure "backbone" for Life Science informatics development, in support of cancer research. caCORE provides software, programming interfaces, controlled vocabulary, and meta-data standards that enable data-rich, interoperable applications to be developed.
Our aim is to integrate the various APIs to caCORE into a comprehensive programming environment for cancer biomedical informatics. caCORE 1.0 is now available for public use, with a comprehensive Technical Guide.
(02-10-08)
- Daylight Chemical Information Systems, Inc., products page
http://www.daylight.com/products/
(site:) software and developer libraries which can be integrated to build chemical information infrastructures across a broad range of informatics requirements
- (DNASTAR [commercial])
http://www.dnastar.com/cgi-bin/php.cgi?index4.php
(site:)
DNASTAR develops and supports the best sequence analysis software for life scientists in Pharmaceutical, Biotechnology, Academic and Government Organizations worldwide. (02-12-02)
- Ensembl [sic] Genome Browser Project
http://www.ensembl.org/
(site:)
A joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust. Access to all the data produced by the project, and to the software used to analyse and present it, is provided free and without constraints. Presents up-to-date sequence data and the best possible automatic annotation for eukaryotic genomes.
-- The Ensembl trace repository provides a permanent archive for single pass DNA sequencing reads. It is exchanging data regularly with the NCBI trace archive.
(mam:)
Species available: Human, mouse, zebrafish, fugu, mosquito, with varying degrees of completeness, accuracy, and data assembly. EBI says they also have fruit fly data and gives an FTP link for it, but I don't see mention of drosophila on Ensembl's own site. There are about 25 species listed for the trace repository.
- NameSort for Excel - sort your chemicals
http://www.signet-uk.com/namesort.html
(mam:) commercial sorting software
(site:) Want to sort your data - no longer split your names into different fields, index your data regularly and easily using NameSort - a flexible utility that ignores case and punctuation. The source of the data is unimportant - upper or lower case, a mixture of different conventions, names input by many people over a period of time.
- $33 - academic/personal use
- $49 - commercial/single user
- $99 - commercial - 5 users/academic full license
- OpenGALEN
http://www.opengalen.org/
(site:) a technology that is designed to represent clinical information in a new way, and is intended to "put the clinical into the clinical workstation". GALEN produces a computer-based multilingual coding system for medicine, using a qualitatively different approach from those used in the past.
(mam:) Not-for-profit, growing out of European Community projects. Includes "OpenKnoME - open source ontological engineering toolset".
- STN commercial database provider, incl. chem &pharm.
http://www.cas.org/stn.html
(mam:) has a page on CAS website
See also Information Extraction Projects.
- Carrara & Guarino 1999.
Formal Ontology and Conceptual Analysis: A Structured Bibliography
http://www.ladseb.pd.cnr.it/infor/ontology/Papers/Ontobiblio/TOC.html
(site:)
Mainly intended for computer scientists working in areas such as AI, Databases, or Computational Linguistics, with a specific interest in Conceptual Analysis issues (the analysis and the formalization of the domain structure). A rather exhaustive bibliography on ontology design, complemented by more general papers on the role of ontological research in various areas of computer science, and an extensive selection of philosophical and linguistic papers.
(02-10-29)
- Enrico Franconi's list of KR/DB Conferences and Journal CFPs
http://www.cs.man.ac.uk/~franconi/ontology.html
(mam:)
A CS professor at the Free University of Bozen-Bolzano, Italy; well, that's what his web site says, even though it's at the University of Manchester, England. Looks like he's working there.
(02-10-06)
- KSL: Knowledge Systems Laboratory, Stanford University
http://www.ksl.stanford.edu/
(mam:)
Runs Ontolingua
(site:)
conducts research in the core AI areas of knowledge representation and reasoning, within the Department of Computer Science at Stanford University.
(02-10-29)
- Ontolingua
http://www.ksl.stanford.edu/software/ontolingua/
(mam:)
Run by KSL.
(site:)
Ontolingua provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies. The server supports over 150 active users, some of whom have provided us with descriptions of their projects. Also found here are a number of additional services such as a Webster gateway, CML model fragment editor, a multiple perspective data structure inspector, equation solving, automatic explanation of device behavior, and so on.
(02-10-29)
- Stanford Univ., Sites relevant to ontologies and knowledge sharing
http://ksl-web.stanford.edu/kst/ontology-sources.html
(site:) Specific Projects, General Resources Pages, and Sources for Implemented Ontologies.
(mam:) Maintained by Richard Fikes; Last modified Tue Jan 16 2001
- Stanford Univ., What is an ontology?
http://ksl-web.stanford.edu/kst/what-is-an-ontology.html
(mam:) definition and theory by Tom Gruber
- Univ. of Texas, Some Ongoing KBS/Ontology Projects and Groups
http://www.cs.utexas.edu/users/mfkb/related.html
(site:) Also: [ Mailing Lists | On-Line Proceedings | Some Conferences and Workshops ]
- Wilbur et al. (n.d.) Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods
http://skr.nlm.nih.gov/papers/references/chemicals.pdf
(site:)
In order to improve NLM's capabilities in chemical text processing, two approaches to the problem of recognizing chemical nomenclature were explored. One of the statistical methods had an overall classification accuracy of 97%.
(02-10-29)
- NIST ACE program
http://www.itl.nist.gov/iaui/894.01/tests/ace/
(mam:)
See also PropBanking.
(site:)
Objective: to develop automatic content extraction technology to support automatic processing of human language in text form: newswire, broadcast news (text via ASR), and newspaper (text via OCR).
(02-10-10)
- AGTK: Annotation Graph Toolkit
http://agtk.sourceforge.net/
(site:)
Annotation Graphs are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems. Based at the LDC.
(03-01-13)
- ACL-02 Workshop on Natural Language Processing in the Biomedical Domain
http://acl.ldc.upenn.edu/acl2002/BIO/index.htm
(mam:)
July 11, 2002, at UPenn
(myl:)
Thanks to Steven Bird we have this temporary access pending proper incorporation
of these materials into the anthology.
(site:)
The aim of this workshop is to focus on challenges in processing biomedical language and to present results in developing techniques for this domain. This domain presents many opportunities for NLP technologies such as information extraction from biomedical texts, document and answer retrieval from large, unstructured text collections (such as the biomedical literature and the World Wide Web), and interaction with users through natural language.
(02-10-21)
- COCOSDA Linguistic Annotation resources
http://www.ldc.upenn.edu/annotation/
(site:)
tools and formats for creating and managing linguistic annotations. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. This page began as a set of links to systems for speech annotation, and the coverage of textual annotation is still inadequate.
(mam:)
Maintained by Steven Bird, Mark Liberman, and the LDC
(02-12-03)
- PropBanking and the ACE Project at LDC
- (LDC ACE (Automatic Content Extraction) project)
http://www.ldc.upenn.edu/Projects/ACE
(mam:)
Participant in the NIST ACE program. This page is just an entry to the Phase 1
and Phase 2 pages.
(site:)
Corpus creation to support Automatic Content Extraction. LDC has recently completed entity and relation annotation to support Phase 2 of ACE. We're currently preparing to move into multilingual (English, Chinese, Arabic) ACE annotation.
(02-10-23)
- PropBank, Automatic Content Extraction (ACE) at the University of Pennsylvania
http://www.cis.upenn.edu/~ace/
(mam:)
Includes useful contacts, references, and tools that we use here. PropBank uses text already annotated by Treebank.
(site:)
The PropBank project is creating a corpus of text annotated with information about basic semantic propositions. Predicate-argument relations are being added to the syntactic trees of the Penn Treebank.
Undertaken as part of NIST's ACE program.
(02-10-09)
- (Phase 1 overview)
http://www.ldc.upenn.edu/Projects/ACE/PHASE1/index.html
(mam:)
Updated: 4/26/2000. Of primarily historical interest.
(02-10-23)
(site:)
Pilot Study Information for LDC Annotators
- Phase 2 resource overview
http://www.ldc.upenn.edu/Projects/ACE/PHASE2/index.html
(site:)
Updated: 8/22/2001. Under construction.
(02-10-23)
- Current annotation resources for the LDC ACE project
Annotated by myl, mostly
Obviously the types of entities and relations that we are after in
this project will be quite different, but many of the problems are
similar, and we will need a similar sort of detailed instructions for
annotators.
- PropBank Taggers' Manual
http://www.cis.upenn.edu/~ace/taggermanual.pdf
(site:)
Annotation Manual for Predicate-Argument Structure Taggers, Version 2.1, 23 January 2002, Paul Kingsbury
(02-10-10)
- Treebanking
123 entries
This page built 10:27 am, 2003-05-08. Last edited 2009-11-11.