. 2018:1755:197-221.

doi: 10.1007/978-1-4939-7724-6_14.

Data Mining and Computational Modeling of High-Throughput Screening Datasets

Sean Ekins¹, Alex M Clark^{2

3}, Krishna Dole², Kellan Gregory², Andrew M Mcnutt², Anna Coulon Spektor², Charlie Weatherall², Nadia K Litterman², Barry A Bunin²

Affiliations

¹ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. ekinssean@yahoo.com.
² Collaborative Drug Discovery, Inc., Burlingame, CA, USA.
³ Molecular Materials Informatics, Inc., Montreal, QC, Canada.

PMID: 29671272
PMCID: PMC6181121
DOI: 10.1007/978-1-4939-7724-6_14

Data Mining and Computational Modeling of High-Throughput Screening Datasets

Sean Ekins et al. Methods Mol Biol. 2018.

. 2018:1755:197-221.

doi: 10.1007/978-1-4939-7724-6_14.

Authors

Sean Ekins¹, Alex M Clark^{2

3}, Krishna Dole², Kellan Gregory², Andrew M Mcnutt², Anna Coulon Spektor², Charlie Weatherall², Nadia K Litterman², Barry A Bunin²

Affiliations

¹ Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA. ekinssean@yahoo.com.
² Collaborative Drug Discovery, Inc., Burlingame, CA, USA.
³ Molecular Materials Informatics, Inc., Montreal, QC, Canada.

PMID: 29671272
PMCID: PMC6181121
DOI: 10.1007/978-1-4939-7724-6_14

Abstract

We are now seeing the benefit of investments made over the last decade in high-throughput screening (HTS) that is resulting in large structure activity datasets entering public and open databases such as ChEMBL and PubChem. The growth of academic HTS screening centers and the increasing move to academia for early stage drug discovery suggests a great need for the informatics tools and methods to mine such data and learn from it. Collaborative Drug Discovery, Inc. (CDD) has developed a number of tools for storing, mining, securely and selectively sharing, as well as learning from such HTS data. We present a new web based data mining and visualization module directly within the CDD Vault platform for high-throughput drug discovery data that makes use of a novel technology stack following modern reactive design principles. We also describe CDD Models within the CDD Vault platform that enables researchers to share models, share predictions from models, and create models from distributed, heterogeneous data. Our system is built on top of the Collaborative Drug Discovery Vault Activity and Registration data repository ecosystem which allows users to manipulate and visualize thousands of molecules in real time. This can be performed in any browser on any platform. In this chapter we present examples of its use with public datasets in CDD Vault. Such approaches can complement other cheminformatics tools, whether open source or commercial, in providing approaches for data mining and modeling of HTS data.

Keywords: ADME; Bayesian models; CDD models; CDD vault; Collaborative database; Data mining; Visualization.

PubMed Disclaimer

Figures

**Figure 1.**
A flowchart of the user experience flow of the Visualization application in CDD Vault. DOI: 10.6084/m9.figshare.3206266

**Figure 2.**
A sample plot from the Visualization Module in CDD Vault using Astra Zeneca public solubility data from ChEMBL on 1763 compounds showing the relationship with calculated molecular properties. DOI: 10.6084/m9.figshare.3206266

**Figure 3.**
A. Screenshot of the new Visualization capabilities in CDD Vault, showing The Broad Chagas disease dose response dataset that was used in a recent study by us to build a Bayesian machine learning model [2]. B. A screenshot showing highlighting of structures and filtering of data (right of screen). DOI: 10.6084/m9.figshare.3206266

**Figure 4.**
A flowchart of the technical structure of the Visualization module in CDD Vault. The backend is formed using Immutable and Crossfilter.js, the data binding layer is constructed using d3.js and jQuery, and finally the rendering layer makes use of d3.js and Pixi.js. DOI: 10.6084/m9.figshare.3206266

**Figure 5.**
Receiver Operator Characteristic plots for CDD Bayesian model with FCFP6 descriptors only after 3 fold cross validation for predicting selectivity in kinases using Abbott Laboratories data . A. training set B. The test set ROC for 2 different cutoffs using 39 compounds from the Ambit dataset not found in the training set from the Abbot dataset . DOI: 10.6084/m9.figshare.3206266

**Figure 6.**
Receiver Operator Characteristic plots for Discovery Studio Bayesian Models for Kinase Selectivity using Abbott Laboratories data – minus overlapping compounds in Ambit dataset . Descriptors used: ALogP, FCFP_6, Molecular Weight, Number of Aromatic Rings, Number of H-Bond Acceptors, Number of H-Bond Donors, Number of Rings, Number of Rotatable Bonds, and Molecular Fractional Polar Surface Area. Selectivity values less than 0.3 = active. The Ambit dataset was used as a test set after removal of overlapping compounds. A. Training Set. ROC score 0.870 (leave-one-out). Best cutoff for this model is −2.624. B. Test Set ROC = 0.81 (Confusion Matrix: True Positives = 44, False Negatives = 7, False Positives = 6, True Negatives = 11). DOI: 10.6084/m9.figshare.3206266

**Figure 7.**
A. Good Kinase selectivity model good fingerprints B Kinase selectivity model bad fingerprints. DOI: 10.6084/m9.figshare.3206266

**Figure 8.**
Receiver Operator Characteristic plot for CDD Bayesian model with FCFP6 descriptors only after 3 fold cross validation. Promiscuity of compounds binding to proteins using ~15,000 compounds . with binding data to 100 different proteins. DOI: 10.6084/m9.figshare.3206266

**Figure 9.**
Receiver Operator Characteristic plot for Discovery Studio Model of promiscuity of compounds binding to proteins using ~15,000 compounds with binding data to 100 different proteins. The following descriptors were used: ALogP, FCFP_6, Molecular Weight, Number of Aromatic Rings, Number of H-Bond Acceptors, Number of H-Bond Donors, Number of Rings, Number of Rotatable Bonds, and Molecular Fractional Polar Surface Area. The cutoff for this model was 0.05. ROC score is 0.784 (leave-one-out). Best cutoff for this model is −0.560. DOI: 10.6084/m9.figshare.3206266

**Figure 10.**
A. ~15,000 compounds with binding data to 100 different proteins good fingerprints B. ~15,000 compounds with binding data to 100 different proteins bad fingerprints. DOI: 10.6084/m9.figshare.3206266

**Figure 11.**
Examples of Collaborative Drug Discovery Vault used in large public-private collaborations. DOI: 10.6084/m9.figshare.3206266

See this image and copyright information in PMC

References

1. Macarron R; Banks MN; Bojanic D; Burns DJ; Cirovic DA; Garyantes T; Green DV; Hertzberg RP; Janzen WP; Paslay JW; Schopfer U; Sittampalam GS , Impact of High-Throughput Screening in Biomedical Research. Nat Rev Drug Discov 2011, 10, 188–195. - PubMed
1. Ekins S; Waller CL; Bradley MP; Clark AM; Williams AJ, Four Disruptive Strategies for Removing Drug Discovery Bottlenecks Drug Disc Today 2013, 18, 265–271. - PubMed
1. Oprea TI; Bologa CG; Boyer S; Curpan RF; Glen RC; Hopkins AL; Lipinski CA; Marshall GR; Martin YC; Ostopovici-Halip L; Rishton G; Ursu O; Vaz RJ; Waller C; Waldmann H; Sklar LA, A Crowdsourcing Evaluation of the Nih Chemical Probes. Nat Chem Biol 2009, 5, 441–447. - PMC - PubMed
1. Roy A; McDonald PR; Sittampalam S; Chaguturu R, Open Access High Throughput Drug Discovery in the Public Domain: A Mount Everest in the Making. Curr Pharm Biotechnol 2010, 11, 764–778. - PMC - PubMed
1. Kaiser J, National Institutes of Health. Drug-Screening Program Looking for a Home. Science 2011, 334, 299. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R44 TR000942/TR/NCATS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Data Mining and Computational Modeling of High-Throughput Screening Datasets

Affiliations

Data Mining and Computational Modeling of High-Throughput Screening Datasets

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources