Industry-scale application and evaluation of deep learning for drug target prediction

Noé Sturm¹, Andreas Mayr², Thanh Le Van³, Vladimir Chupakhin⁴, Hugo Ceulemans³, Joerg Wegner³, Jose-Felipe Golib-Dzib⁵, Nina Jeliazkova⁶, Yves Vandriessche⁷, Stanislav Böhm⁸, Vojtech Cima⁸, Jan Martinovic⁸, Nigel Greene⁹, Tom Vander Aa¹⁰, Thomas J Ashby¹⁰, Sepp Hochreiter², Ola Engkvist¹¹, Günter Klambauer¹², Hongming Chen¹³

Affiliations

¹ Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden. noesturm@gmail.com.
² LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria.
³ High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium.
⁴ High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen R&D, 1400 McKean Rd, Spring House, Pennsylvania, 19002, USA.
⁵ High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Cilag SA, Calle Río Jarama, 75A, 45007, Toledo, Spain.
⁶ Ideaconsult Ltd., 4. Angel Kanchev Str., 1000, Sofia, Bulgaria.
⁷ Intel Corporation, Data Center Group, Veldkant 31, 2550, Kontich, Belgium.
⁸ IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic.
⁹ Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
¹⁰ Exascience Lab, Imec, Kapeldreef 75, 3001, Louvain, Belgium.
¹¹ Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
¹² LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria. klambauer@ml.jku.at.
¹³ Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden. Hongming.Chen71@hotmail.com.

PMID: 33430964
PMCID: PMC7169028
DOI: 10.1186/s13321-020-00428-5

Industry-scale application and evaluation of deep learning for drug target prediction

Noé Sturm et al. J Cheminform. 2020.

. 2020 Apr 19;12(1):26.

doi: 10.1186/s13321-020-00428-5.

Authors

Affiliations

¹ Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden. noesturm@gmail.com.
² LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria.
³ High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium.
⁴ High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen R&D, 1400 McKean Rd, Spring House, Pennsylvania, 19002, USA.
⁵ High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Cilag SA, Calle Río Jarama, 75A, 45007, Toledo, Spain.
⁶ Ideaconsult Ltd., 4. Angel Kanchev Str., 1000, Sofia, Bulgaria.
⁷ Intel Corporation, Data Center Group, Veldkant 31, 2550, Kontich, Belgium.
⁸ IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic.
⁹ Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
¹⁰ Exascience Lab, Imec, Kapeldreef 75, 3001, Louvain, Belgium.
¹¹ Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
¹² LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria. klambauer@ml.jku.at.
¹³ Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden. Hongming.Chen71@hotmail.com.

PMID: 33430964
PMCID: PMC7169028
DOI: 10.1186/s13321-020-00428-5

Abstract

Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.

Keywords: Big data; ChEMBL; Cheminformatics; Deep learning; Machine learning; Prospective evaluation; PubChem; QSAR; Retrospective evaluation; Structure-based virtual screening.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Compound distributions across the targets for the AstraZeneca and the Janssen dataset, respectively. In the lower panel, the y-axis shows the number of compounds for targets represented by the x-axis, where the targets are sorted according to the number of compounds. The horizontal dashed line represents the maximum number of compounds per target observed in the datasets. In the upper panel, a point represents the activity ratio of a target; targets are sorted the same way as in the lower panel. The curve in the upper panel is a smooth average

**Fig. 2**
Prospective and Retrospective Model Evaluation with three folds (A, B, C). White and colored circles in the Figure represent clusters of compounds, where the size of the circles indicates the cluster sizes (nr. of compounds in the clusters). Colors indicate folds, to which clusters are assigned to, where white circles indicate folds, which are not used for building or evaluating a particular model. In stage 1, the inner loop, one of the three folds serves as the training set, one serves as a test set and the third one is kept aside as a test set for Stage 2a, the outer loop. The respective inner folds used in Stage 1 are merged to training sets for Stage 2a, the retrospective model testing stage. All folds together are merged to the training set for obtaining full-scale models in Stage 2b, the prospective model testing stage. Stage 1 is used for hyperparameter selection of Stage 2a and hyperparameter selection of Stage 2b. For retrospective model testing (Stage 2a) the two respective performance values (Perf X.Y) are averaged in each outer loop iteration step and the hyperparameter setting with the best ROC-AUC value is used for training models in Stage 2a, which finally gives performance values (Perf X) for retrospective model testing. For prospective model testing (Stage 2b) all six performance values (Perf X.Y) of the inner loop are averaged for hyperparameter selection. A final trained model on all data is then evaluated on AstraZeneca and Janssen industrial datasets

**Fig. 3**
ROC-AUC, Kappa and F1-score performances of DNN, XGB and MF models on the ExCAPE-ML dataset. Violin plots illustrate the distribution of individual target performances, boxplots represent the interquartile range, with median value in transparent and average as the horizontal black segment

**Fig. 4**
ROC-AUC, Kappa and F1-score performances of DNN, XGB and MF models on AstraZeneca and Janssen datasets. Violin plots illustrate the distribution of individual target performances, boxplots represent the interquartile range, with median value in transparent and average as the horizontal black segment

**Fig. 5**
Target family breakdown for ExCAPE-ML, AstraZeneca and Janssen predictions. The numbers on the horizontal axis represent the number of targets corresponding to the target family and dataset. The vertical axis represents the AUC value

See this image and copyright information in PMC

References

1. Ekins S, Puhl AC, Zorn KM, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat Mater. 2019;18(5):435–441. doi: 10.1038/s41563-019-0338-z. - DOI - PMC - PubMed
1. Vamathevan J, Clark D, Czodrowski P, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discovery. 2019;18(6):463–477. doi: 10.1038/s41573-019-0024-5. - DOI - PMC - PubMed
1. Wang L, Ding J, Pan L, et al. Artificial intelligence facilitates drug design in the big data era. Chemometrics Intell Lab Syst. 2019;194:103850. doi: 10.1016/j.chemolab.2019.103850. - DOI
1. Gaulton A, Hersey A, Nowotka M, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45(D1):D945–D954. doi: 10.1093/nar/gkw1074. - DOI - PMC - PubMed
1. Kim S, Chen J, Cheng T, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–D1109. doi: 10.1093/nar/gky1033. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Industry-scale application and evaluation of deep learning for drug target prediction

Affiliations

Industry-scale application and evaluation of deep learning for drug target prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials