Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 19;12(1):26.
doi: 10.1186/s13321-020-00428-5.

Industry-scale application and evaluation of deep learning for drug target prediction

Affiliations

Industry-scale application and evaluation of deep learning for drug target prediction

Noé Sturm et al. J Cheminform. .

Abstract

Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.

Keywords: Big data; ChEMBL; Cheminformatics; Deep learning; Machine learning; Prospective evaluation; PubChem; QSAR; Retrospective evaluation; Structure-based virtual screening.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Compound distributions across the targets for the AstraZeneca and the Janssen dataset, respectively. In the lower panel, the y-axis shows the number of compounds for targets represented by the x-axis, where the targets are sorted according to the number of compounds. The horizontal dashed line represents the maximum number of compounds per target observed in the datasets. In the upper panel, a point represents the activity ratio of a target; targets are sorted the same way as in the lower panel. The curve in the upper panel is a smooth average
Fig. 2
Fig. 2
Prospective and Retrospective Model Evaluation with three folds (A, B, C). White and colored circles in the Figure represent clusters of compounds, where the size of the circles indicates the cluster sizes (nr. of compounds in the clusters). Colors indicate folds, to which clusters are assigned to, where white circles indicate folds, which are not used for building or evaluating a particular model. In stage 1, the inner loop, one of the three folds serves as the training set, one serves as a test set and the third one is kept aside as a test set for Stage 2a, the outer loop. The respective inner folds used in Stage 1 are merged to training sets for Stage 2a, the retrospective model testing stage. All folds together are merged to the training set for obtaining full-scale models in Stage 2b, the prospective model testing stage. Stage 1 is used for hyperparameter selection of Stage 2a and hyperparameter selection of Stage 2b. For retrospective model testing (Stage 2a) the two respective performance values (Perf X.Y) are averaged in each outer loop iteration step and the hyperparameter setting with the best ROC-AUC value is used for training models in Stage 2a, which finally gives performance values (Perf X) for retrospective model testing. For prospective model testing (Stage 2b) all six performance values (Perf X.Y) of the inner loop are averaged for hyperparameter selection. A final trained model on all data is then evaluated on AstraZeneca and Janssen industrial datasets
Fig. 3
Fig. 3
ROC-AUC, Kappa and F1-score performances of DNN, XGB and MF models on the ExCAPE-ML dataset. Violin plots illustrate the distribution of individual target performances, boxplots represent the interquartile range, with median value in transparent and average as the horizontal black segment
Fig. 4
Fig. 4
ROC-AUC, Kappa and F1-score performances of DNN, XGB and MF models on AstraZeneca and Janssen datasets. Violin plots illustrate the distribution of individual target performances, boxplots represent the interquartile range, with median value in transparent and average as the horizontal black segment
Fig. 5
Fig. 5
Target family breakdown for ExCAPE-ML, AstraZeneca and Janssen predictions. The numbers on the horizontal axis represent the number of targets corresponding to the target family and dataset. The vertical axis represents the AUC value

References

    1. Ekins S, Puhl AC, Zorn KM, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat Mater. 2019;18(5):435–441. doi: 10.1038/s41563-019-0338-z. - DOI - PMC - PubMed
    1. Vamathevan J, Clark D, Czodrowski P, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discovery. 2019;18(6):463–477. doi: 10.1038/s41573-019-0024-5. - DOI - PMC - PubMed
    1. Wang L, Ding J, Pan L, et al. Artificial intelligence facilitates drug design in the big data era. Chemometrics Intell Lab Syst. 2019;194:103850. doi: 10.1016/j.chemolab.2019.103850. - DOI
    1. Gaulton A, Hersey A, Nowotka M, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45(D1):D945–D954. doi: 10.1093/nar/gkw1074. - DOI - PMC - PubMed
    1. Kim S, Chen J, Cheng T, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–D1109. doi: 10.1093/nar/gky1033. - DOI - PMC - PubMed