Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 22;25(1):bbad491.
doi: 10.1093/bib/bbad491.

Quantification of biases in predictions of protein-protein binding affinity changes upon mutations

Affiliations

Quantification of biases in predictions of protein-protein binding affinity changes upon mutations

Matsvei Tsishyn et al. Brief Bioinform. .

Abstract

Understanding the impact of mutations on protein-protein binding affinity is a key objective for a wide range of biotechnological applications and for shedding light on disease-causing mutations, which are often located at protein-protein interfaces. Over the past decade, many computational methods using physics-based and/or machine learning approaches have been developed to predict how protein binding affinity changes upon mutations. They all claim to achieve astonishing accuracy on both training and test sets, with performances on standard benchmarks such as SKEMPI 2.0 that seem overly optimistic. Here we benchmarked eight well-known and well-used predictors and identified their biases and dataset dependencies, using not only SKEMPI 2.0 as a test set but also deep mutagenesis data on the severe acute respiratory syndrome coronavirus 2 spike protein in complex with the human angiotensin-converting enzyme 2. We showed that, even though most of the tested methods reach a significant degree of robustness and accuracy, they suffer from limited generalizability properties and struggle to predict unseen mutations. Interestingly, the generalizability problems are more severe for pure machine learning approaches, while physics-based methods are less affected by this issue. Moreover, undesirable prediction biases toward specific mutation properties, the most marked being toward destabilizing mutations, are also observed and should be carefully considered by method developers. We conclude from our analyses that there is room for improvement in the prediction models and suggest ways to check, assess and improve their generalizability and robustness.

Keywords: machine learning; prediction biases; protein complex structure; protein–protein binding affinity; protein–protein interactions; symmetry principle.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Characteristics of the Sformula image dataset. (A) Number of occurrences of mutation types; (B) Distribution of the experimental formula image values (in kcal/mol).
Figure 2
Figure 2
Pearson correlations formula image between experimental and predicted formula image values on direct (in blue) and reverse (in orange) mutations of Sformula image (left) and Cformula image (right).
Figure 3
Figure 3
Predicted formula image values as a function of experimental formula image values (in kcal/mol) for the datasets Sformula image-D (blue dots) and Sformula image-R (orange dots). Predictions are obtained with mCSM-PPI2, MutaBind2, BeAtMuSiC, SSIPe, SAAMBE-3D, NetTree, BindProfX and FoldX.
Figure 4
Figure 4
Relation between the covering ratio formula image and the Pearson correlation formula image between predicted and experimental formula image values on the Sformula image-D set for six benchmarked predictors. The linear regression line (dashed) and coefficient of determination (formula image) are indicated.
Figure 5
Figure 5
Distribution of the shift formula image (in kcal/mol) for the eight benchmarked predictors calculated for mutations from Cformula image. The vertical blue dashed lines indicate formula image and the vertical red dashed lines, the value of formula image.
Figure 6
Figure 6
Normalized RMSE (formula image) of the eight predictors on subsets of Sformula image-D. Subsets were defined based on (a) mutation type: mutation toward Ala (A) versus other mutations (nA); (b) mutation location: mutations at the interface (I) versus other mutations (nI). (c) complex type: mutation on dimeric complexes (D) versus mutations on multi-n-meric complexes (formula image) (nD).

Similar articles

Cited by

References

    1. Sahni N, Yi S, Taipale M, et al. . Widespread macromolecular interaction perturbations in human genetic disorders. Cell 2015;161(3):647–60. - PMC - PubMed
    1. Cheng F, Zhao J, Wang Y, et al. . Comprehensive characterization of protein–protein interactions perturbed by disease mutations. Nat Genet 2021;53(3):342–53. - PMC - PubMed
    1. Yadav A, Vidal M, Luck K. Precision medicine–networks to the rescue. Curr Opin Biotechnol 2020;63:177–89. - PMC - PubMed
    1. Cui H, Zhao N, Korkin D. Multilayer view of pathogenic SNVs in human interactome through in silico edgetic profiling. J Mol Biol 2018;430(18):2974–92. - PubMed
    1. Nevola L, Giralt E. Modulating protein–protein interactions: the potential of peptides. Chem Commun 2015;51(16):3302–15. - PubMed

Publication types

Substances