Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec;5(12):1390-1401.
doi: 10.1038/s42256-023-00751-0. Epub 2023 Nov 6.

Calibrated geometric deep learning improves kinase-drug binding predictions

Affiliations

Calibrated geometric deep learning improves kinase-drug binding predictions

Yunan Luo et al. Nat Mach Intell. 2023 Dec.

Abstract

Protein kinases regulate various cellular functions and hold significant pharmacological promise in cancer and other diseases. Although kinase inhibitors are one of the largest groups of approved drugs, much of the human kinome remains unexplored but potentially druggable. Computational approaches, such as machine learning, offer efficient solutions for exploring kinase-compound interactions and uncovering novel binding activities. Despite the increasing availability of three-dimensional (3D) protein and compound structures, existing methods predominantly focus on exploiting local features from one-dimensional protein sequences and two-dimensional molecular graphs to predict binding affinities, overlooking the 3D nature of the binding process. Here we present KDBNet, a deep learning algorithm that incorporates 3D protein and molecule structure data to predict binding affinities. KDBNet uses graph neural networks to learn structure representations of protein binding pockets and drug molecules, capturing the geometric and spatial characteristics of binding activity. In addition, we introduce an algorithm to quantify and calibrate the uncertainties of KDBNet's predictions, enhancing its utility in model-guided discovery in chemical or protein space. Experiments demonstrated that KDBNet outperforms existing deep learning models in predicting kinase-drug binding affinities. The uncertainties estimated by KDBNet are informative and well-calibrated with respect to prediction errors. When integrated with a Bayesian optimization framework, KDBNet enables data-efficient active learning and accelerates the exploration and exploitation of diverse high-binding kinase-drug pairs.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Extended Data Fig. 1 ∣
Extended Data Fig. 1 ∣. Prediction performance evaluation on KIBA dataset.
(a) Four train-test split settings of evaluation, where the model is evaluated on data of unseen drugs (‘new-drug split’), unseen proteins (‘new-protein split’) or both (‘both-new split’), and unseen proteins with low (<50%) sequence identity (‘seq-id split’). (b) Comparisons of prediction performance of KDBNet with KronRLS, DeepDTA, GraphDTA, DGraphDTA, EnzPred, and ConPLex on the KIBA dataset using four train-test split settings. The performances of GP were not shown as it was not evaluated in the original study and it is computationally costly to run GP at the scale of KIBA dataset because of the high memory footprint of kernel computation. Performances were evaluated using three metrics, including Pearson correlation, Spearman correlation, and mean squared error (MSE) between predicted and true KIBA scores. All bar plots represented the mean ± SD of evaluation results on five random train/test splits. Abbreviations: seq. id.: sequence identity.
Fig. 1 |
Fig. 1 |. Overview of KDBNet.
KDBNet is a neural network that integrates protein 3D structure and compound 3D structure to predict compound–protein binding affinity. KDBNet derives a set of features, including sequence (seq), evolutionary representations and 3D-invariant geometric features, on the basis of the input 3D structure and uses a GNN to learn structure-aware representations of a protein For the input compound, KDBNet uses a 3D-equivariant GNN to directly learn structure representations from the compound’s coordinates in the 3D space. The representations of the input protein and are then used to predict the binding affinity as well as the uncertainty of the prediction.
Fig. 2 |
Fig. 2 |. KDBNet achieves accurate prediction of kinase–drug binding affinity.
a, Four train–test split evaluation settings in which the model is evaluated on data of unseen drugs (‘new-drug split’), unseen proteins (‘new-protein split’) or both (‘both-new split’) and unseen proteins with low (<50%) sequence identity (‘seq-id split’). b, Comparison of KDBNet prediction performance with KronRLS, DeepDTA, GraphDTA, DGraphDTA and GP on the four train–test split settings. c, Comparisons between KDBNet variants that use or do not use 3D structure data on the both-new split. When 3D drug structure is not used, the 2D molecule graph parsed from a SMILES string is used as the representation of the input drug, and no 3D geometric features are used in the molecule GNN. When 3D protein structure is not used, the sequence is used as the representation of the input protein, and the protein GNN is replaced by a convolutional neural networ. The full model use both 3D drug and protein structures. d, Comparisons between KDBNet and three baseline methods that receive 3D binding complex structure as input (GNN3D, CNN3D and SIGN) on the PDBbind dataset. KDBNet differs from them in that it only uses separate 3D drug and protein structures as input: the baseline methods thus have an advantage as they are aware of the protein–compound docking structure through the input complex. Results of methods that receive non-3D input (GraphDTA and DeepDTA) are also shown for comparison. Asterisks indicate the statistical significance (one-sided Mann–Whitney U rank test P=0.00397 for both GraphDTA and DeepDTA) that KDNBet’s performance is higher than the baseline’s performance over n=5 random train/test splits. Bar plots in b-d represent the mean ±s.d. of the evaluation results on five random train/test splits. Pearson correlation and MSE were computed using the predicted and true pd values.
Fig. 3 |
Fig. 3 |. KDBNet provides accurate and calibrated uncertainty estimation.
a, Prediction errors of KDBNet, GP and GP-MLP, measured as MAE, at different cutoffs of uncertainty percentiles. The x axis represents the sorted uncertainty such that the 100% percentile is the lowest uncertainty (highest confidence). b, Spearman correlation between the estimated uncertainty and the prediction error measured in MAE on the both-new test set. c, Calibration curve. For a confidence interval of confidence level e(0e1), the curve shows the expected fraction and observed fraction of test points that fall within that interval. The diagonal line corresponds to the calibration curve of a perfectly calibrated model. The miscalibration area, defined as the area between a curve and the diagonal line, is used to quantify the uncertainty calibration, and lower values indicate better calibration. d, Calibration curves of KDBNet, KDBNet without recalibration, GP and GP-MLP on test sets. e, Miscalibration area of KDBNet, GP and GP-MLP on the new-protein test set. Solid lines in curve plots represent the mean value of five independent trials, and error bands indicate the s.d. MAE values were measured in pKd values. Bar plots in b and e represent the mean ±s.d. of the evaluation results on five random train/test splits. Error bands in a and c depict mean ±s.d. calculated over five random train/test splits. Recalib., recalibration.
Fig. 4 |
Fig. 4 |. Leveraging uncertainty for active learning, exploration and exploitation.
a, Schematic visualization of the active learning process, which consists of several rounds of model training, data acquisition and model evaluation. b, Active learning performance in Pearson correlation on the KIBA both-new test set at different rounds. The explorative sampling is compared to the greedy and random sampling strategies. c, Efficiency gain of the explorative and greedy samplings over the random sampling, defined as the relative improvement in Pearson correlation. d, Schematic illustration of data acquisition on the basis of KDBNet’s prediction and uncertainty. One can exploit regions with high-confidence, high-desirability samples or explore potentially high-desirability regions with less model confidence. e, Exploration using KDBNet and UCB acquisition function with a BO framework. Curves represent the performance trajectory, measured by the percentage of top-500 binding affinities found as a function of the number of kinase–compound pairs explored in the Davis dataset. f, Exploitation using KDBNet and LCB acquisition function with BO. True Kd values of the top 10 kinase–drug pairs prioritized by each model are shown. A lower Kd value means a stronger binding affinity. Curve plots in b, c and e depict the mean values over five independent trials in solid lines with s.d. in error bands. Bar plots in f represent mean ± s.d. of the results for five independent trials of top-10 acquisition (n=50).

References

    1. Oprea TI et al. Unexplored therapeutic opportunities in the human genome. Nat. Rev. Drug Discov 17, 317–332 (2018). - PMC - PubMed
    1. Attwood MM, Fabbro D, Sokolov AV, Knapp S & Schiöth HB Trends in kinase drug discovery: targets, indications and inhibitor design. Nat. Rev. Drug Discov 20, 839–861 (2021). - PubMed
    1. Cohen P, Cross D & Jänne PA Kinase drug discovery 20 years after imatinib: progress and future directions. Nat. Rev. Drug Discov 20, 551–569 (2021). - PMC - PubMed
    1. Hanson SM et al. What makes a kinase promiscuous for inhibitors? Cell Chem. Biol 26, 390–399 (2019). - PMC - PubMed
    1. Arrowsmith CH et al. The promise and peril of chemical probes. Nat. Chem. Biol 11, 536–541 (2015). - PMC - PubMed

LinkOut - more resources