Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 16;118(11):e2017228118.
doi: 10.1073/pnas.2017228118.

Protein sequence design by conformational landscape optimization

Collaborators, Affiliations

Protein sequence design by conformational landscape optimization

Christoffer Norn et al. Proc Natl Acad Sci U S A. .

Abstract

The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen's thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.

Keywords: energy landscape; machine learning; protein design; sequence optimization; stability prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Protein sequence design. (A) The goal of fixed backbone protein design is to find a sequence that best specifies the desired structure (P). Traditional energy-based methods have approached the problem heuristically, focusing solely on minimizing the energy of the target conformation in the hope that any stable alternative conformation is unlikely to arise by chance. However, this narrow focus on a single desired structure can produce solutions with low-energy alternative states, as suggested in the energy landscape of sequence α. An ideal method would instead find the sequence that maximizes the probability of the desired structure over all other states. Such a method would select sequence β. (B) Overview of trRosetta fixed backbone sequence design method. Starting with a random matrix of sequence length by number of amino acids (logits), the maximum value at each position is taken to generate a sequence, which is fed into the trRosetta model. The output is the predicted distribution of distances, angles, and dihedrals for every pair of residues (here, we only show distances). The loss is defined as the difference between the target and prediction, and the gradient is computed to minimize the loss. After normalization, the gradient is applied to the logits, and the process is repeated until convergence.
Fig. 2.
Fig. 2.
trRosetta predicts properties of the folding energy landscape. (A) trRosetta better predicts which designs will have high Boltzmann probabilities than Rosetta-based energy calculations, which only see the target conformation (classifications for Pnear > 0.8, AUCtrRosetta = 0.81 vs. AUCRosetta = 0.65, n = 4,204 designs). (B) trRosetta correctly predicts dilution of probability for designs with multiple low-energy conformations (columns 2 to 4) compared with designs with a single global energy minimum (column 1). Structural decoys were binned, and the mean trRosetta score corrected for background (logPc(struct.|seq.)) is represented by the color gradient from dark blue (high probability) to red (low probability). (C) Structures of the lowest-energy representatives (indicated by circles on the energy landscape). The designed structures and alternative states are shown on the left and right, respectively, of each column. (D) Selected examples of probability distributions (Cβ–Cβ distance prediction) for specific i,j pairs (numbering indicated on the top) demonstrating bimodality. The actual distances observed in the designed and alternative structures are indicated by vertical lines (blue and red, respectively) and shown as spheres on the corresponding structures. More examples are shown in SI Appendix, Fig. S4, and an analysis of the prediction of bimodality from distributions of individual i,j pairs can be found in SI Appendix, Fig. S5.
Fig. 3.
Fig. 3.
(A and B) trRosetta predicts scaffold designability and experimental success. (A) Across different topologies, trRosetta predictions are better correlated with experimental protease stability—a measure of folding success—than Rosetta energy (R2trRosetta = 0.79 vs. R2Rosetta = 0.00, P value < 0.0001) (SI Appendix, Methods). Data points are the topology-specific mean values, and error bars represent SDs (eight topologies, Ntot = 30,159 designs). Without topological averaging, the correlation decreases (R2trRosetta = 0.20 vs. R2Rosetta = 0.03, P value < 0.0001) because intratopological differences are not well captured by trRosetta (SI Appendix, Fig. S6 has details). (B) trRosetta is significantly better at discriminating experimental success (expression, nonaggregation, and having correct secondary structure content) than Rosetta energy (AUCtrRosetta = 0.81 vs. AUCRosetta = 0.64). Data from 145 Foldit-generated designs (16). (C and D) Designing with a hybrid trRosetta–Rosetta protocol disfavors off-target states. (C) Examples of energy landscapes for two Foldit-generated backbones, each designed with trRosetta, Rosetta, and the trRosetta–Rosetta hybrid protocol. (D) The hybrid protocol improves the quality of the resulting energy landscapes, as determined by the Pnear quantity. trRosetta on its own also improves funnels but only superficially (better performance than Rosetta in the lower Pnear regime). It does not, however, generate a deep minimum in the vicinity of the designed state (poorer performance than Rosetta in the high Pnear regime). (E) The local sequence–structure relationship is idealized in trRosetta designs compared with both native proteins and Rosetta designs. The native structures that were used for redesign were at most 30% sequence identical to any protein in the trRosetta training dataset. Local sequence–structure agreements were measured as the average RMSD between the designed structure and nine-residue fragments from the PDB that were selected based on the sequence of the design. (F) For the same set of native backbones, trRosetta redesigns have a more native-like distribution of hydrophobic residues (F, I, L, V, M, W, Y) on the protein surface than Rosetta redesigns. The degree of burial was assessed with the software DEPTH (33), which computes the distance in ångströms between each residue and bulk solvent. A breakdown by amino acid is in SI Appendix, Fig. S13.

References

    1. Jones D. T., De novo protein design using pairwise potentials and a genetic algorithm. Protein Sci. 3, 567–574 (1994). - PMC - PubMed
    1. Kuhlman B., et al., Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003). - PubMed
    1. Dahiyat B. I., Mayo S. L., De novo protein design: Fully automated sequence selection. Science 278, 82–87 (1997). - PubMed
    1. Ingraham J., Garg V., Barzilay R., Jaakkola T., Generative models for graph-based protein design. NeurIPS Proc. 32, 15820–15831 (2019).
    1. Greener J. G., Moffat L., Jones D. T., Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018). - PMC - PubMed

Publication types