Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 22;9(9):e107353.
doi: 10.1371/journal.pone.0107353. eCollection 2014.

Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation

Affiliations

Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation

Niklas Berliner et al. PLoS One. .

Abstract

Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT) algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. ELASPIC methodology.
Schematic view of the strategy used to derive predictive features and train and validate ELASPIC for the prediction of stability effects in domain core and domain-domain interfaces upon mutation.
Figure 2
Figure 2. Summary of the results.
Correlation between predicted and experimental ΔΔG values for our curated ProTherm core dataset (A) and SKEMPI interface dataset (B). (C) Comparative histograms of the Pearson correlation among several state-of-the-art methods using three versions of ProTherm datasets for the core predictions, and SKEMPI dataset for the interface prediction.
Figure 3
Figure 3. Feature importance for core and interface predictions.
Histogram representing the relative importance of the different features for core predictions (A) and interface prediction (B). To avoid cluttering, only features with a relative importance of 10% or larger were considered and coloured according to the three categories. Abbreviations: t: torsional, diS: disulfide, E: electrostatics, ion: ionization, dS: entropy, Hdipole: helix dipole, cb: covalent bond, sb: salt bridge, hb: hydrogen bond, cisb: cysteine bond, wb: water bridge, vdW: wan der Waals, mc: main chain, sc: side chain, if: interface, dm: domain, sasa: solvent accessibility, solv: solvation, ap: apolar, po: polar (see Table S1 for feature description).
Figure 4
Figure 4. Summary of stability prediction of nsSNP mutations.
Predicted absolute ΔΔGDT box plots (right) are shown for (A) core and (B) interface mutations and the three types of mutations (Hapmap, OMIM and COSMIC driver/passenger).

References

    1. Sauna ZE, Kimchi-Sarfaty C (2011) Understanding the contribution of synonymous mutations to human disease. Nat Rev Genet 12: 683–691 10.1038/nrg3051 - DOI - PubMed
    1. Hagmann M (1999) A Good SNP May Be Hard to Find. Science 285: 21–22 10.1126/science.285.5424.21a - DOI - PubMed
    1. Risch NJ (2000) Searching for genetic determinants in the new millennium. Nature 405: 847–856 10.1038/35015718 - DOI - PubMed
    1. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337: 64–69 10.1126/science.1219240 - DOI - PMC - PubMed
    1. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, et al. (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327: 78–81 10.1126/science.1181498 - DOI - PubMed

Publication types