Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Sep 6;12(9):1246.
doi: 10.3390/biom12091246.

Protein Function Analysis through Machine Learning

Affiliations
Review

Protein Function Analysis through Machine Learning

Chris Avery et al. Biomolecules. .

Abstract

Machine learning (ML) has been an important arsenal in computational biology used to elucidate protein function for decades. With the recent burgeoning of novel ML methods and applications, new ML approaches have been incorporated into many areas of computational biology dealing with protein function. We examine how ML has been integrated into a wide range of computational models to improve prediction accuracy and gain a better understanding of protein function. The applications discussed are protein structure prediction, protein engineering using sequence modifications to achieve stability and druggability characteristics, molecular docking in terms of protein-ligand binding, including allosteric effects, protein-protein interactions and protein-centric drug discovery. To quantify the mechanisms underlying protein function, a holistic approach that takes structure, flexibility, stability, and dynamics into account is required, as these aspects become inseparable through their interdependence. Another key component of protein function is conformational dynamics, which often manifest as protein kinetics. Computational methods that use ML to generate representative conformational ensembles and quantify differences in conformational ensembles important for function are included in this review. Future opportunities are highlighted for each of these topics.

Keywords: allostery; conformational sampling; force fields; machine learning; molecular docking; protein dynamics; protein function; protein structure prediction; protein–protein interactions.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Figures

Figure 1
Figure 1
Machine learning meets computational biology. This review is about how machine learning (gray center circle) intersects with multiple aspects of computational biology (colored circles).
Figure 2
Figure 2
Four pillars of computational biology: The role of protein structure is critical in all panels. Conformational ensembles, red panel: ML helps with: (i) enhanced sampling; (ii) identifying collective variables; (iii) automated potential biasing; and (iv) Markovian state space partitioning. Protein stability, green panel: ML helps with (i) modeling the role of environment; (ii) protein engineering through mutagenesis; (iii) characterization of protein–protein interactions; and (iv) the role of rigidity for protein function. Protein dynamics, yellow panel: ML helps with (i) protein flexibility/conformational dynamics; (ii) dynamic allostery; and (iii) potential energy/force fields. Drug discovery, blue panel: ML helps with (i) molecular docking; and (ii) binding affinity prediction.
Figure 3
Figure 3
CASP 12-14 top 5 competitors per year (x-axis). The performance for each competition was based primarily on the summation of positive Zscores (y-axis) with respect to GDT_TS for each of the proposed structure prediction models. The accuracy metric, GDT_TS, is a multiscale indicator for the proximity of Cα atoms in a model to those in the corresponding experimental structure.
Figure 4
Figure 4
Time scales of protein function. Protein function occurs on time scales that span many orders of magnitude. MD simulation time steps are limited to the femtosecond range, necessitating enhanced sampling methods for the analysis of long time scale processes. Enhanced sampling typically starts with an unsupervised simulation of proteins to initially explore conformation space. Clustering and dimensionality reduction inform methods such as metadynamics on how to bias further simulations for the exploration of unsampled or poorly sampled regions in the free-energy landscape. Machine learning has been deployed to achieve better and faster sampling of this landscape.
Figure 5
Figure 5
Protein stability paradigm for globular proteins. Mechanical and thermodynamic stability are intimately related. The native state of globular proteins is driven by favorable enthalpic interactions through the hydrogen bond network and packing interactions. The unfolded state is driven by conformational entropy, associated with an increase in conformational flexibility. The transition state represents a mixture of opposing thermodynamic and mechanical elements, which determines protein folding pathways.
Figure 6
Figure 6
Examples of protein functional dynamics. In (ac), three critical conformational states [297,298,299] of calmodulin are shown during the process of ligand binding. The unbound structure (a) binds to calcium ions (b), then the resulting structure is able to bind with a substrate (c) [300]. On the bottom row, (d,e) show the native state motions of proteins including surface loop flexibility [301,302] and concerted domain fluctuations [303]. (f) A visual of dynamic allostery pathways resulting from changes to the vibrational modes of a protein upon binding a ligand [304].
Figure 7
Figure 7
Computational drug discovery relies on the interplay between molecular docking and binding affinity prediction, both of which have been enhanced by ML. Docking methods can use ML to account for subtleties in molecular interaction such as flexibility or predict inter-molecular contacts, while ML-powered binding affinity functions score poses. In both cases, experimental data are used to train models, and methods will continue to improve as more data are collected.

References

    1. Jarvis R.A., Patrick E.A. Clustering Using a Similarity Measure Based on Shared Near Neighbors. IEEE Trans. Comput. 1973;C-22:1025–1034. doi: 10.1109/T-C.1973.223640. - DOI
    1. Sturm B.L., Ben-Tal O., Monaghan Ú., Collins N., Herremans D., Chew E., Hadjeres G., Deruty E., Pachet F. Machine learning research that matters for music creation: A case study. J. New Music Res. 2019;48:36–55. doi: 10.1080/09298215.2018.1515233. - DOI
    1. Rodolfa K.T., Lamba H., Ghani R. Empirical observation of negligible fairness–accuracy trade-offs in machine learning for public policy. Nat. Mach. Intell. 2021;3:896–904. doi: 10.1038/s42256-021-00396-x. - DOI
    1. Brook T. Music, Art, Machine Learning, and Standardization. Leonardo. 2021:1–11. doi: 10.1162/leon_a_02135. - DOI
    1. Xu C., Jackson S.A. Machine learning and complex biological data. Genome Biol. 2019;20:76. doi: 10.1186/s13059-019-1689-0. - DOI - PMC - PubMed

Publication types

LinkOut - more resources