Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Oct;21(5):913-925.
doi: 10.1016/j.gpb.2022.11.014. Epub 2023 Mar 30.

Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms

Affiliations
Review

Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms

Bin Huang et al. Genomics Proteomics Bioinformatics. 2023 Oct.

Abstract

Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.

Keywords: Deep learning; Language model; Protein folding; Protein structure prediction; Transformer.

PubMed Disclaimer

Conflict of interest statement

Fusong Ju and Jianwei Zhu are the current employees of Microsoft Corp. Qi Zhang is the current employee of Huawei Technologies Co., Ltd. All the other authors have declared no competing interests.

Figures

Figure 1
Figure 1
Protein sequence, protein structure, and protein structure prediction A. An example of protein sequence and its tertiary structure. Here, we show a C-terminal fragment of the ribosomal protein L7/L12 from Escherichia coli (PDB: 1CTF), which consists of a total of 74 residues linked via peptide bonds. The tertiary structure specifies the unique 3D coordinates of each atom in the relative position of the whole protein. Cartoon backbone representation is widely used to visualize protein tertiary structure. B. Homology modeling method for protein structure prediction. C. Threading method for protein structure prediction. D.Ab initio prediction approach. PDB, Protein Data Bank; 3D, 3-dimensional.
Figure 2
Figure 2
Chronological diagram of the representative approaches to protein structure prediction Here, homology modeling approaches are shown in red, template-based approaches are shown in green, ab initio approaches are shown in blue, and other techniques are shown in black.
Figure 3
Figure 3
Performance of representative approaches to protein structure prediction A. Performance of the prediction approaches in previous CASPs. Trendlines indicate the agreement of the target protein backbone for the best-predicted structures with that of the native structures in the last 14 CASP rounds; open circles indicate the individual data points for CASP14. Target difficulty is based on sequence and structural similarity to existing experimental protein structures, which was adapted from with permission. B. Prediction performance of AlphaFold2 for 20,296 human proteins covering 10,537,122 residues. For each protein, AlphaFold2 outputs a pLDDT score as an estimation of the prediction quality. For nearly 36% of proteins, AlphaFold2 predicts their structures with high confidence (pLDDT ≥ 90). The data were taken from . C. The performance of the prediction approaches using MSAs or a single sequence as input. On 29 selected CASP-free modeling targets, AlphaFold2 and RoseTTAFold show excellent accuracy when using MSAs of query proteins as input. However, their performances decrease sharply when using a single sequence of query protein as their only input. In contrast, OmegaFold and ProFOLD Single, the approaches specially designed for single-sequence prediction, achieve high accuracies that approximate the approaches using MSAs. It should be noted that the accuracy of ProFOLD Single is acquired from CASP14 target proteins to avoid overlapping between training and test data, which was adapted from with modifications. CASP, Critical Assessment of Structure Prediction; GDT_TS, Global Distance Test-Total Score; MSA, multiple sequence alignment; pLDDT, predicted Local Distance Difference Test.
Figure 4
Figure 4
Strong structural signals in protein Se0862 (PDB: 6UF2) Three types of regions that might carry strong structural signals, including single helical turn (blue), β-turn (red), and a pair of secondary structural elements with contact between them (purple).
Figure 5
Figure 5
An example ofaninter-residue contactin GFP (PDB: 4EUL) and co-mutations observed in its homologs Two residues in contact 55V–106Y (shown in red) co-mutate to 55I–106F (in green) to maintain the contact between them; and thus, in turn, the co-mutations observed in homologous proteins can be exploited to infer inter-residue contacts. To demonstrate this, we use ProDESIGN-LE2, a protein sequence design method, to design four sequences (P1–P4) for the structure of GFP. As the design process of ProDESIGN-LE2 resembles the evolution of the target protein, the resulting designed sequences could be used as an approximation of the homologies of target proteins. ProDESIGNE-LE2 is an improved version of ProDESIGN-LE . GFP, green fluorescent protein.
Figure 6
Figure 6
Predicted structures for CASP14 targetsT1049-D1, T1031-D1, and T1067-D1 by AlphaFold2, BAKER, Zhang-Server, and RaptorX For each representative target (in rows) in a target group defined in CASP14 and each predicting method (in columns), the alignment between the predicted structure (red) and the native structure (blue) is shown. Targets are mainly classified into TBM and FM categories using their prediction quality and template detectability. TBM, template-based modeling; FM, free modeling; TM-score, template modeling score.

Similar articles

Cited by

References

    1. Branden C., Tooze J. 2nd ed. Garland Science; New York: 1998. Introduction to protein structure.
    1. Finkelstein A.V., Ptitsyn O.B. 2nd ed. Elsevier; Amsterdam: 2016. Protein physics: a course of lectures.
    1. Kaur H., Garg A., Raghava G.P.S. PEPstr: a de novo method for tertiary structure prediction of small bioactive peptides. Protein Pept Lett. 2007;14:626–631. - PubMed
    1. Yang Y., Gao J., Wang J., Heffernan R., Hanson J., Paliwal K., et al. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief Bioinform. 2018;19:482–494. - PMC - PubMed
    1. Dill K.A., MacCallum J.L. The protein-folding problem, 50 years on. Science. 2012;338:1042–1046. - PubMed

Publication types