Cheminformatics, sometimes spelt chemoinformatics, encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information (4). It encompasses a number of well-established computational areas that have existed separately for many years. Cheminformatics can be roughly divided into two areas: data creation and analysis.
Data creation covers the creation (or compound/library registration), storage, and display of chemical information. The technology has a long and well-recorded pedigree and will only be mentioned briefly. For small molecule database creation and storage, the most commonly used systems are from MDL (5), Daylight (6), Accelrys (7), and Tripos (8). All have their own proprietary way of storing chemical structures, their activities and properties and other related information, using relational database (Oracle) technology. These 4 database systems have their own interface for viewing and displaying molecular structures, either provided as part of the commercial system used and modified or designed internally by pharmaceutical/biotech companies to suit internal business practices.
Data analysis consists of a large spectrum of computational chemistry techniques (4), such as chemometrics, statistical methods for molecular similarity or diversity analysis and QSAR. Typically, chemical structures are represented by a connection table or in an equivalent linear notation, such as SMILES strings (9) or SLN (10) (Sybyl Line Notation). Structural keys or fingerprints, that describe either the presence or absence of sub-structures or fragments, are used in 2D/3D structural searches, the most commonly used being MDL's 2D structure MACCS
ANNUAL REPORTS IN MEDICINAL CHEMISTRY—38 ISSN: 0065-7743
© 2003 Elsevier Inc All rights reserved.
keys, Daylight, and Unity fingerprints (7,11,12). Chemical descriptors, based on chemical, physical, and topological properties are also commonly used (13). For 3D descriptors, most common used are 3 or 4-points pharmacophores (14) derived either from a set of ligands or from the binding sites of the protein structures. These might be determined experimentally by X-ray or nuclear magnetic resonance (NMR) techniques or from homology modeling techniques. There are numerous compound-clustering techniques , which group compounds by molecular similarity (15). This is measured most commonly using the Tanimoto coefficient, although there are other measures (16,17). Molecular search (or compound selection) techniques employ substructure, 2D similarity or 3D pharmacophore searches. For 3D searching when there is no target structural information, 3D QSAR, based on statistical analyses of the molecular fields or pharmacophore features of aligned molecules with similar biological activity, can sometimes identify the 3D interactions of each molecule that are important for ligand binding.
Cheminformatics plays an important role in both lead generation and optimization. For example, if the structure of the substrate or an existing ligand is known, 2D/3D similarity searches can be conducted using the template structure to select compounds for testing. Diversity analysis and molecular clustering can be employed to select a representative subset of compounds for biological screening, especially high-throughput screening (HTS), from a large set of compounds, either from a commercial source or from a corporate database. Library design, QSAR, computational chemistry, molecular modeling and diversity analysis all contribute to the design of focused and diverse libraries, split/mix combinatorial chemistry (combi-chem), parallel synthesis, and further virtual screening.
Figure 1. Number of New Molecular Entities (NME) reported from the FDA for the years 1990-2002.
Figure 1. Number of New Molecular Entities (NME) reported from the FDA for the years 1990-2002.
However, in the past five years, a major emphasis of cheminformatics has been in the area of in silico ADMET (drug Absorption, Distribution, Metabolism, Excretion, and Toxicity). Figure 1 represents a plot of the number of new molecular entities (NME) reported by the FDA for the years 1990-2002. It can be seen that the role of new NME introduction jumped in 1996, but has since steadily tapered, which suggests that producing drugs with sufficient therapeutic efficacy, selectivity and ADMET properties to satisfy the regulatory approval framework is becoming recently more difficult.
Many drug discovery programs fail largely because the drug candidate fails clinical trials in humans, resulting in a great deal of lost time, expenditure and effort. From an analysis conducted here at Inpharmatica on the IDdb3 database (18), among the approximate 25,000 entries of drug candidate records for the past 10 years, 42% of them are now inactive and 60% of these compounds failed before entering phase I. A study conducted by Edwards presented a similar observation. They have collected data from 9 leading pharmaceutical and biotechnology companies during the year of 2001 and the attrition rate from hit compounds to preclinical candidates is 57% (19).
With increased pressure for better products faster, the concept of "Fast in man" is being actively pursued by the industry and ADMET properties of a compound have become the important parameters when designing and optimizing lead-like compounds, especially in combi-chem and focused library design. The quest for most potent receptor binding, formerly the primary goal by medicinal chemists, is not the only parameter considered these days. Due to the difficulties of ADMET studies, this strategy will require a shift from a 'screening' based approach to a knowledge-based compound selection and modification paradigm. In addition, in vitro high-throughput assays are increasingly employed to approximate the ADMET characteristics of potential leads at earlier stages of development.
The molecular physiochemical properties such as molecular weight, H-bond donor and acceptor, logP, number of heteroatoms and rotatable bonds, PSA (polar surface areas), toxic and reactive fragments in the molecules can all affect the ADMET properties in general. The best known analyses, the 'rule of 5' was performed by Lipinski and his colleagues in Pfizer (20). His data mining approach showed that good oral absorbed drug typically have a MW less than 500, fewer than 5 hydrogen bond donors, fewer than 10 hydrogen bond acceptors (approximated by number of heteroatoms less than 10), and calculated logP less than 5. Additional findings have emerged recently, which added new rules, such as the number of rotatable bonds being less than 10, the ring count less than 5, etc (21, 22).
Many computational methods are used for ADMET prediction. The most common ones are basic statistics (which is how the rule-of 5 was derived), SAR, and more artificial learning approaches such as genetic algorithms and neural networks.
All these molecular parameters are used in analyzing compound data and building models to predict ADMET properties (23), such as absorption studies in Caco-2 (24) (a human intestinal epithelial cell line derived from a colorectal carcinoma); MDCK (25) (madin-Darby canine kidney) cell monolayers; susceptibility to metabolic degradation using liver microsomes or hepatocytes (26); prediction of logBB in the blood-brain barrier system whose purpose is to maintain the homeostasis of the CNS (central nervous system) by separating the brain from the systemic blood circulation (27). The data sources for analysis usually come from large commercial database, such as ACD (5) for non-drug like, and MDDR (5) and WDI (28) for drug-like compounds.
Most pharmaceutical companies have put such concepts and rules into practice, most commonly as filters for library design, warning flags during compound registration, and in compound selections for high-throughput or virtual screening. These filters include the physical and chemical parameters mentioned above (such as filtering out MW less than 100 and more than 700). In addition, toxic or reactive groups or fragments can be also included in the filters. Training such filters sometimes totally relies on expert knowledge provided by medicinal chemists.
Although this kind of expert knowledge is not widely publicized, some companies have published portions of their findings (29,30).
There is an urgent need to analyze the large amount of chemical data that we have to generate new knowledge and aid decision-making. With the advances in computing and related technologies, such as data storage, web and programming advances, it has become easier to integrate all the information together under the umbrella of cheminformatics. The success of this integration will speed up the traditional R&D process relative to the traditional drug discovery process.
The goal of cheminformatics is to generate knowledge through data integration. The ultimate goal would be to identify the development compound that can inhibit the target protein and have the perfect ADMET data making it safe for use in humans. Currently a more realistic goal is to identify which compounds are 'leadlike', what kind of target libraries a chemist should make, etc.
The pharmaceutical, biotechnology and related industries have been quite successful in achieving these goals. Many companies are now taking the further step of integrating bioinformatics, proteomics, target discovery, target validation, and chemistry, which, together with cheminformatics, form the basis of chemogenomics.
Was this article helpful?