Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018:1755:197-221.
doi: 10.1007/978-1-4939-7724-6_14.

Data Mining and Computational Modeling of High-Throughput Screening Datasets

Affiliations

Data Mining and Computational Modeling of High-Throughput Screening Datasets

Sean Ekins et al. Methods Mol Biol. 2018.

Abstract

We are now seeing the benefit of investments made over the last decade in high-throughput screening (HTS) that is resulting in large structure activity datasets entering public and open databases such as ChEMBL and PubChem. The growth of academic HTS screening centers and the increasing move to academia for early stage drug discovery suggests a great need for the informatics tools and methods to mine such data and learn from it. Collaborative Drug Discovery, Inc. (CDD) has developed a number of tools for storing, mining, securely and selectively sharing, as well as learning from such HTS data. We present a new web based data mining and visualization module directly within the CDD Vault platform for high-throughput drug discovery data that makes use of a novel technology stack following modern reactive design principles. We also describe CDD Models within the CDD Vault platform that enables researchers to share models, share predictions from models, and create models from distributed, heterogeneous data. Our system is built on top of the Collaborative Drug Discovery Vault Activity and Registration data repository ecosystem which allows users to manipulate and visualize thousands of molecules in real time. This can be performed in any browser on any platform. In this chapter we present examples of its use with public datasets in CDD Vault. Such approaches can complement other cheminformatics tools, whether open source or commercial, in providing approaches for data mining and modeling of HTS data.

Keywords: ADME; Bayesian models; CDD models; CDD vault; Collaborative database; Data mining; Visualization.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A flowchart of the user experience flow of the Visualization application in CDD Vault. DOI: 10.6084/m9.figshare.3206266
Figure 2.
Figure 2.
A sample plot from the Visualization Module in CDD Vault using Astra Zeneca public solubility data from ChEMBL on 1763 compounds showing the relationship with calculated molecular properties. DOI: 10.6084/m9.figshare.3206266
Figure 3.
Figure 3.
A. Screenshot of the new Visualization capabilities in CDD Vault, showing The Broad Chagas disease dose response dataset that was used in a recent study by us to build a Bayesian machine learning model [2]. B. A screenshot showing highlighting of structures and filtering of data (right of screen). DOI: 10.6084/m9.figshare.3206266
Figure 3.
Figure 3.
A. Screenshot of the new Visualization capabilities in CDD Vault, showing The Broad Chagas disease dose response dataset that was used in a recent study by us to build a Bayesian machine learning model [2]. B. A screenshot showing highlighting of structures and filtering of data (right of screen). DOI: 10.6084/m9.figshare.3206266
Figure 4.
Figure 4.
A flowchart of the technical structure of the Visualization module in CDD Vault. The backend is formed using Immutable and Crossfilter.js, the data binding layer is constructed using d3.js and jQuery, and finally the rendering layer makes use of d3.js and Pixi.js. DOI: 10.6084/m9.figshare.3206266
Figure 5.
Figure 5.
Receiver Operator Characteristic plots for CDD Bayesian model with FCFP6 descriptors only after 3 fold cross validation for predicting selectivity in kinases using Abbott Laboratories data . A. training set B. The test set ROC for 2 different cutoffs using 39 compounds from the Ambit dataset not found in the training set from the Abbot dataset . DOI: 10.6084/m9.figshare.3206266
Figure 6.
Figure 6.
Receiver Operator Characteristic plots for Discovery Studio Bayesian Models for Kinase Selectivity using Abbott Laboratories data – minus overlapping compounds in Ambit dataset . Descriptors used: ALogP, FCFP_6, Molecular Weight, Number of Aromatic Rings, Number of H-Bond Acceptors, Number of H-Bond Donors, Number of Rings, Number of Rotatable Bonds, and Molecular Fractional Polar Surface Area. Selectivity values less than 0.3 = active. The Ambit dataset was used as a test set after removal of overlapping compounds. A. Training Set. ROC score 0.870 (leave-one-out). Best cutoff for this model is −2.624. B. Test Set ROC = 0.81 (Confusion Matrix: True Positives = 44, False Negatives = 7, False Positives = 6, True Negatives = 11). DOI: 10.6084/m9.figshare.3206266
Figure 7.
Figure 7.
A. Good Kinase selectivity model good fingerprints B Kinase selectivity model bad fingerprints. DOI: 10.6084/m9.figshare.3206266
Figure 8.
Figure 8.
Receiver Operator Characteristic plot for CDD Bayesian model with FCFP6 descriptors only after 3 fold cross validation. Promiscuity of compounds binding to proteins using ~15,000 compounds . with binding data to 100 different proteins. DOI: 10.6084/m9.figshare.3206266
Figure 9.
Figure 9.
Receiver Operator Characteristic plot for Discovery Studio Model of promiscuity of compounds binding to proteins using ~15,000 compounds with binding data to 100 different proteins. The following descriptors were used: ALogP, FCFP_6, Molecular Weight, Number of Aromatic Rings, Number of H-Bond Acceptors, Number of H-Bond Donors, Number of Rings, Number of Rotatable Bonds, and Molecular Fractional Polar Surface Area. The cutoff for this model was 0.05. ROC score is 0.784 (leave-one-out). Best cutoff for this model is −0.560. DOI: 10.6084/m9.figshare.3206266
Figure 10.
Figure 10.
A. ~15,000 compounds with binding data to 100 different proteins good fingerprints B. ~15,000 compounds with binding data to 100 different proteins bad fingerprints. DOI: 10.6084/m9.figshare.3206266
Figure 11.
Figure 11.
Examples of Collaborative Drug Discovery Vault used in large public-private collaborations. DOI: 10.6084/m9.figshare.3206266

References

    1. Macarron R; Banks MN; Bojanic D; Burns DJ; Cirovic DA; Garyantes T; Green DV; Hertzberg RP; Janzen WP; Paslay JW; Schopfer U; Sittampalam GS , Impact of High-Throughput Screening in Biomedical Research. Nat Rev Drug Discov 2011, 10, 188–195. - PubMed
    1. Ekins S; Waller CL; Bradley MP; Clark AM; Williams AJ, Four Disruptive Strategies for Removing Drug Discovery Bottlenecks Drug Disc Today 2013, 18, 265–271. - PubMed
    1. Oprea TI; Bologa CG; Boyer S; Curpan RF; Glen RC; Hopkins AL; Lipinski CA; Marshall GR; Martin YC; Ostopovici-Halip L; Rishton G; Ursu O; Vaz RJ; Waller C; Waldmann H; Sklar LA, A Crowdsourcing Evaluation of the Nih Chemical Probes. Nat Chem Biol 2009, 5, 441–447. - PMC - PubMed
    1. Roy A; McDonald PR; Sittampalam S; Chaguturu R, Open Access High Throughput Drug Discovery in the Public Domain: A Mount Everest in the Making. Curr Pharm Biotechnol 2010, 11, 764–778. - PMC - PubMed
    1. Kaiser J, National Institutes of Health. Drug-Screening Program Looking for a Home. Science 2011, 334, 299. - PubMed

Publication types

LinkOut - more resources