Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Dec 29:21:630-643.
doi: 10.1016/j.csbj.2022.12.039. eCollection 2023.

Beyond sequence: Structure-based machine learning

Affiliations
Review

Beyond sequence: Structure-based machine learning

Janani Durairaj et al. Comput Struct Biotechnol J. .

Abstract

Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.

Keywords: Deep learning; Machine learning; Protein structures.

PubMed Disclaimer

Conflict of interest statement

We have no conflicts of interest to disclose.

Figures

ga1
Graphical abstract
Fig. 1
Fig. 1
Common steps in structure-based machine learning. A) Starting from a set of protein sequences, structural models can either be retrieved from the PDB or constructed using computational approaches. B) A number of different feature extraction, feature engineering, or pre-trained embedding approaches can then be used C) to extract a matrix representation of the input, with the rows as data points and columns representing features or embedding values. D) This matrix forms the input for ML models resulting in predictions of classes, regression values, or unsupervised clustering and dimensionality reduction. E) Prediction results, combined with the trained model, can be used to inspect and interpret regions of the protein structure relevant for the task at hand.
Fig. 2
Fig. 2
Different approaches for computational representation of a protein structure which go beyond features of individual residues. For A-D features or representations calculated across individual blocks (respectively: spheres, grids, polyhedra, surface patches) are used as input to ML, while for E-F, the entire matrix or graph is often used in methods specifically designed for these kinds of inputs. A Overlapping spheres B 3D voxel grids C Geometric tesselations D Molecular surface representations E Distance/contact maps F Graph representations.

References

    1. Zerbino D.R., Wilder S.P., Johnson N., Juettemann T., Flicek P.R. The ensembl regulatory build. Genome Biol. 2015;16(1):56. doi: 10.1186/s13059-015-0621-5. - DOI - PMC - PubMed
    1. Bileschi M.L., Belanger D., Bryant D.H., Sanderson T., Carter B., Sculley D., Bateman A., DePristo M.A., Colwell L.J. Using deep learning to annotate the protein universe. Nat Biotechnol. 2022;40(6):932–937. doi: 10.1038/s41587-021-01179-w. - DOI - PubMed
    1. Gane A., Bileschi, M.L., Dohan D., Speretta E., Héliou A., Meng-Papaxanthos L., Zellner H., Brevdo E., Parikh A., Orchard S. ProtNLM: model-based natural language protein annotation.
    1. IllergÅrd K., Ardell D.H., Elofsson A. Structure is three to ten times more conserved than sequence–a study of structural response in protein cores. Proteins Struct Funct Bioinform. 2009;77(3):499–508. doi: 10.1002/prot.22458. - DOI - PubMed
    1. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The protein data bank. Nucleic Acids Res. 2000;28(1):235–242. - PMC - PubMed

LinkOut - more resources