Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jul;297(1):100870.
doi: 10.1016/j.jbc.2021.100870. Epub 2021 Jun 11.

Toward the solution of the protein structure prediction problem

Affiliations
Review

Toward the solution of the protein structure prediction problem

Robin Pearce et al. J Biol Chem. 2021 Jul.

Abstract

Since Anfinsen demonstrated that the information encoded in a protein's amino acid sequence determines its structure in 1973, solving the protein structure prediction problem has been the Holy Grail of structural biology. The goal of protein structure prediction approaches is to utilize computational modeling to determine the spatial location of every atom in a protein molecule starting from only its amino acid sequence. Depending on whether homologous structures can be found in the Protein Data Bank (PDB), structure prediction methods have been historically categorized as template-based modeling (TBM) or template-free modeling (FM) approaches. Until recently, TBM has been the most reliable approach to predicting protein structures, and in the absence of reliable templates, the modeling accuracy sharply declines. Nevertheless, the results of the most recent community-wide assessment of protein structure prediction experiment (CASP14) have demonstrated that the protein structure prediction problem can be largely solved through the use of end-to-end deep machine learning techniques, where correct folds could be built for nearly all single-domain proteins without using the PDB templates. Critically, the model quality exhibited little correlation with the quality of available template structures, as well as the number of sequence homologs detected for a given target protein. Thus, the implementation of deep-learning techniques has essentially broken through the 50-year-old modeling border between TBM and FM approaches and has made the success of high-resolution structure prediction significantly less dependent on template availability in the PDB library.

Keywords: contact map; deep learning; distance prediction; end-to-end structure prediction; free modeling; multiple sequence alignment; protein structure prediction; template-based modeling;.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest The authors declare that they have no conflicts of interest with the contents of this article.

Figures

Figure 1
Figure 1
Important milestones in protein structure prediction that are covered in this review.
Figure 2
Figure 2
Typical steps in a homology-based modeling pipeline. Starting from a query sequence, templates are identified using sequence-based alignment algorithms. Then the structural framework of the best template alignment is copied, and the unaligned regions are constructed to produce the final model.
Figure 3
Figure 3
Typical steps in template/fragment assembly and gradient descent-based protein structure prediction pipelines. Starting from a query sequence, a multiple sequence alignment (MSA) is constructed by identifying homologous sequences from a sequence database. Then using profiles or predicted structural features derived from the MSA, either global template structures (for TBM) or local fragments (for FM) are identified from databases of solved protein structures. Additionally, coevolutionary analysis of the MSA is fed into deep neural networks to predict pairwise restraints such as distance maps, interresidue orientations, and hydrogen bond networks. The structure assembly stage may either assembly the local fragments, global template structure, or directly minimize the structure using rapid gradient descent methods. From here, the final model may be selected by clustering the conformations generated during the structure assembly stage or by identifying the lowest energy structure, which is further refined using atomic-level refinement simulations to produce a final model.
Figure 4
Figure 4
Interresidue spatial restraints that are often used to assist protein 3D structure assembly simulations. The protein backbone atoms include the N, Cα, and C atoms, while the side chains include the Cβ atoms, with the exception of glycine, as well as the R groups, which distinguish the different amino acid residues. A, Cα/Cβ contacts and distances; B, interresidue torsion angles; C, hydrogen bond networks. Here, the backbone hydrogen bonds are represented using a Cα-based model, where three consecutive Cα atoms form a local coordinate system, from which various vectors and their orientations represent regular hydrogen bonding patterns observed in native proteins. D, typical pipeline for spatial restraint prediction. Starting from the amino acid sequence of a target protein, homologous protein sequences are collected from sequence databases and compiled to form a multiple sequence alignment (MSA). For the MSA, coevolutionary relationships are deduced and fed into a deep neural network, which may output the predicted contact/distance maps, interresidue orientations, and hydrogen bond networks.
Figure 5
Figure 5
Summary of contact map prediction results in CASP11 to 14.A, contact prediction results for different groups on all FM and FM/TBM targets. Groups are sorted in descending order of the average precision of their top L/5 long-range contacts, where L is the protein length and long-range contacts occur between positions that are separated by at least 24 residues. B, relationship between contact prediction precision and the MSA Neff value obtained by the DeepMSA program (184), where lines are the best fit on the individual targets by linear regression.
Figure 6
Figure 6
Summary of structure prediction results in the recent CASP experiments.A, relationship between the best TM score of the first submitted model and the Neff value of the MSA generated by the DeepMSA program (184). B, mean TM score of the best first TBM and FM models submitted in the corresponding CASP competitions. C, results for the best first TBM models (including TBM, TBM-easy, TBMA-hard, and FM/TBM) submitted by any group in CASP7/11 to 14, where the models are categorized into one of three categories based on their TM scores: [0, 0.5), [0.5, 0.914], (0.914, 1.0]. D, results for the best first FM models submitted by any group in CASP7/11 to 14, where the models are categorized into one of three categories based on their TM scores: [0, 0.5), [0.5, 0.914], (0.914, 1.0].
Figure 7
Figure 7
Representative examples of AlphaFold2 on multidomain protein structures in CASP14. The experimental structures are shown in red cartoons, while the predicted models are shown in different colors for different domains. A, modeling results for T1038, where AlphaFold2 achieved excellent performance on both the domain-level and full-length models. B, modeling results for T1052, where the domain-level models achieved an extremely high accuracy, but the full-length assembled structure had incorrect domain orientations.

References

    1. Anfinsen C.B. Principles that govern folding of protein chains. Science. 1973;181:223–230. - PubMed
    1. Sanger F., Nicklen S., Coulson A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A. 1977;74:5463–5467. - PMC - PubMed
    1. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., Gocayne J.D., Amanatides P., Ballew R.M., Huson D.H., Wortman J.R. The sequence of the human genome. Science. 2001;291:1304–1351. - PubMed
    1. Metzker M.L. Sequencing technologies - the next generation. Nat. Rev. Genet. 2010;11:31–46. - PubMed
    1. Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2019;47:D94–D99. - PMC - PubMed

Publication types

MeSH terms