Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jun:68:194-207.
doi: 10.1016/j.sbi.2021.01.007. Epub 2021 Feb 24.

Deep learning techniques have significantly impacted protein structure prediction and protein design

Affiliations
Review

Deep learning techniques have significantly impacted protein structure prediction and protein design

Robin Pearce et al. Curr Opin Struct Biol. 2021 Jun.

Abstract

Protein structure prediction and design can be regarded as two inverse processes governed by the same folding principle. Although progress remained stagnant over the past two decades, the recent application of deep neural networks to spatial constraint prediction and end-to-end model training has significantly improved the accuracy of protein structure prediction, largely solving the problem at the fold level for single-domain proteins. The field of protein design has also witnessed dramatic improvement, where noticeable examples have shown that information stored in neural-network models can be used to advance functional protein design. Thus, incorporation of deep learning techniques into different steps of protein folding and design approaches represents an exciting future direction and should continue to have a transformative impact on both fields.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interest

The authors declare no conflict of interest.

Figures

Fig 1.
Fig 1.
Typical steps involved in template-free and template-based protein structure prediction approaches. Starting from a query sequence, an MSA is generated by identifying homologous sequences from a sequence database. The MSA is then converted into a sequence profile and used to predict structural features such as the secondary structure, backbone torsion angles and solvent accessibility. For fragment assembly-based FM methods, these structural features together with the sequence profile are used to search a fragment library to identify high scoring local fragments. For TBM methods, they are used by threading protocols to identify global template structures. Meanwhile, co-evolutionary information is extracted from the MSA and fed into a deep residual neural network to predict spatial restraints such as inter-residue long-range contacts, distances, hydrogen bonds and torsion angles. For full-length model construction, structure assembly simulations are performed under the guidance of a composite force field which usually combines the generic knowledge- and/or physics-based energy function with deep neural network feature prediction (plus template-based restraints in the case of TBM). Finally, representative models are typically selected from the lowest energy conformations or based on structural clustering, followed by atomic-level refinement to generate the final model.
Fig 2.
Fig 2.
Domain-level protein structure prediction results for AlphaFold2 in the CASP14 experiment. (A) The first-rank models by AlphaFold2 (green) superposed on the experimental structures (red) for the 23 FM domains, together with the domain ID and TM-score values. The pictures are listed in descending order of the TM-scores of the AlphaFold2 models. (B) TM-score versus Neff, the number of effective sequences in the multiple sequence alignments collected by DeepMSA, for all 89 FM (stars) and TBM and TBM/FM (circles) domains. Dashed and dashed-dotted lines mark the two TM-score cutoffs at 0.5 and 0.914, respectively.
Fig 3.
Fig 3.
Typical steps involved in a fragment assembly-based approach to design new protein structures. Starting from the desired secondary structure together with user-defined packing restraints, such as residue-residue contact/distance restraints, the query is searched through a non-redundant PDB structure library using gapless threading to generate position-specific fragment structures. High scoring fragments, which may range from 1-20 residues long, are identified based on the complementarity between the desired secondary structure and a fragment’s secondary structure and backbone torsion angles. Then during the folding simulations, the top scoring local fragments are assembled under the guidance of a sequence-independent energy function, which accounts for fundamental rules that govern protein folding such as secondary structure packing, backbone hydrogen bonding, favorable backbone torsion angles, steric clashes, radius of gyration, as well as the artificial contact/distance restraints supplied by the user. As the method is sequence independent, generic side-chain centers of mass, typically those for valine, are used to evaluate energy terms such as steric clashes. Following the folding simulations, the final design may be selected based on clustering of the simulation decoys, by selecting the lowest energy structure, or through whatever filter the user deems appropriate.
Fig 4.
Fig 4.
A protocol for evolution-based protein-protein interaction design used by EvoDesign. The procedure starts from an input complex, for which monomer/interface structural homologs are identified from the PDB library through TM-align and iAlign searches, respectively. Structural profiles are then constructed from the alignments of the monomer/interface analogs and used in conjunction with a physics-based potential, EvoEF2, to guide the REMC simulations to design novel protein sequences. The final designs are selected from the center of the largest cluster of designed sequence decoys.
Fig 5.
Fig 5.
Protein folds designed de novo starting from 9 unique secondary structures. The designed folds and corresponding wildtype native proteins (with denoted PDB IDs) whose secondary structures were used as input are shown side-by-side for (A) 3 β proteins, (B) 3 α/β and α+β proteins, and (C) 3 a proteins. Even in the absence of pre-defined packing rules, such as inter-residue distance restraints, the designed new folds have well-packed topologies with lower or comparable Rosetta and EvoEF2 energies.

References

    1. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 2015, 521:436–444. - PubMed
    1. Wu ST, Zhang Y: MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins-Structure Function and Bioinformatics 2008, 72:547–556. - PMC - PubMed
    1. Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21:951–960. - PubMed
    1. Zheng W, Zhang C, Wuyun Q, Pearce R, Li Y, Zhang Y: LOMETS2: improved meta-threading server for fold-recognition and structure-based function annotation for distant-homology proteins. Nucleic Acids Res 2019, 47:W429–W436. - PMC - PubMed
    1. Zhang Y: Progress and challenges in protein structure prediction. Current Opinion in Structural Biology 2008, 18:342–348. - PMC - PubMed

Publication types

LinkOut - more resources