. 2021 Mar 16;118(11):e2017228118.

doi: 10.1073/pnas.2017228118.

Protein sequence design by conformational landscape optimization

Christoffer Norn^{1

2}, Basile I M Wicky^{1

2}, David Juergens^{1

2

3}, Sirui Liu⁴, David Kim^{1

2}, Doug Tischer^{1

2}, Brian Koepnick^{1

2}, Ivan Anishchenko^{1

2}; Foldit Players; David Baker^{5

2

6}, Sergey Ovchinnikov^{7

8}

Collaborators, Affiliations

Collaborators

Foldit Players:
Alan Coral, Alex J Bubar, Alexander Boykov, Alexander Uriel Valle Pérez, Alison MacMillan, Allen Lubow, Andrea Mussini, Andrew Cai, Andrew John Ardill, Aniruddha Seal, Artak Kalantarian, Barbara Failer, Belinda Lackersteen, Benjamin Chagot, Beverly R Haight, Bora Taştan, Boris Uitham, Brandon G Roy, Breno Renan de Melo Cruz, Brian Echols, Brian Edward Lorenz, Bruce Blair, Bruno Kestemont, C D Eastlake, Callen Joseph Bragdon, Carl Vardeman, Carlo Salerno, Casey Comisky, Catherine Louise Hayman, Catherine R Landers, Cathy Zimov, Charles David Coleman, Charles Robert Painter, Christopher Ince Jr, Conor Lynagh, Dmitrii Malaniia, Douglas Craig Wheeler, Douglas Robertson Dr, Vera Simon, Emanuele Chisari, Eric Lim Jit Kai, Farah Rezae, Ferenc Lengyel, Flavian Tabotta, Franco Padelletti, Frisno Boström, Gary O Gross, George McIlvaine, Gil Beecher, Gregory T Hansen, Guido de Jong, Harald Feldmann, Jami Lynne Borman, Jamie Quinn, Jane Norrgard, Jason Truong, Jasper A Diderich, Jeffrey Michael Canfield, Jeffrey Photakis, Jesse David Slone, Joanna Madzio, Joanne Mitchell, John Charles Stomieroski, John H Mitch, Johnathan Robert Altenbeck, Jonas Schinkler, Jonathan Barak Weinberg, Joshua David Burbach, João Carlos Sequeira da Costa, Juan Francisco Bada Juarez, Jón Pétur Gunnarsson, Kathleen Diane Harper, Keehyoung Joo, Keith T Clayton, Kenneth E DeFord, Kevin F Scully, Kevin M Gildea, Kirk J Abbey, Kristen Lee Kohli, Kyle Stenner, Kálmán Takács, LaVerne L Poussaint, Larry C Manalo Jr, Larry C Withers, Lilium Carlson, Linda Wei, Luke Ryan Fisher, Lynn Carpenter, Ma Ji-Hwan, Manuel Ricci, Marcus Anthony Belcastro, Marek Leniec, Marie Hohmann, Mark Thompson, Matthew A Thayer, Matthias Gaebel, Michael D Cassidy, Michael Fagiola, Michael Lewis, Michael Pfützenreuter, Michael Simon, Moamen M Elmassry, Noah Benevides, Norah Kathleen Kerr, Nupur Verma, Oak Shannon, Owen Yin, Pascal Wolfteich, Paul Gummersall, Paweł Tłuścik, Peter Gajar, Peter John Triggiani 4th, Rajarshi Guha, Renton Braden Mathew Innes, Ricky Buchanan, Robert Gamble, Robert Leduc, Robert Spearing, Rodrigo Luccas Corrêa Dos Santos Gomes, Roger D Estep, Ryan DeWitt, Ryan Moore, Scott G Shnider, Scott J Zaccanelli, Sergey Kuznetsov, Sergio Burillo-Sanz, Seán Mooney, Sidoruk Vasiliy, Slava S Butkovich, Spencer Bruce Hudson, Spencer Len Pote, Stephen Phillip Denne, Steven A Schwegmann, Sumanth Ratna, Susan C Kleinfelter, Thomas Bausewein, Thomas J George, Tobias Scherf de Almeida, Ulas Yeginer, Walter Barmettler, Warwick Robert Pulley, William Scott Wright, Willyanto, Wyatt Lansford, Xavier Hochart, Yoan Anthony Skander Gaiji, Yuriy Lagodich, Vivier Christian

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA 98105.
² Institute for Protein Design, University of Washington, Seattle, WA 98105.
³ Graduate Program in Molecular Engineering, University of Washington, Seattle, WA 98105.
⁴ Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138.
⁵ Department of Biochemistry, University of Washington, Seattle, WA 98105; dabaker@uw.edu so@fas.harvard.edu.
⁶ Howard Hughes Medical Institute, University of Washington, Seattle, WA 98105.
⁷ Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138; dabaker@uw.edu so@fas.harvard.edu.
⁸ John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA 02138.

PMID: 33712545
PMCID: PMC7980421
DOI: 10.1073/pnas.2017228118

Protein sequence design by conformational landscape optimization

Christoffer Norn et al. Proc Natl Acad Sci U S A. 2021.

. 2021 Mar 16;118(11):e2017228118.

doi: 10.1073/pnas.2017228118.

Authors

Collaborators

Foldit Players:
Alan Coral, Alex J Bubar, Alexander Boykov, Alexander Uriel Valle Pérez, Alison MacMillan, Allen Lubow, Andrea Mussini, Andrew Cai, Andrew John Ardill, Aniruddha Seal, Artak Kalantarian, Barbara Failer, Belinda Lackersteen, Benjamin Chagot, Beverly R Haight, Bora Taştan, Boris Uitham, Brandon G Roy, Breno Renan de Melo Cruz, Brian Echols, Brian Edward Lorenz, Bruce Blair, Bruno Kestemont, C D Eastlake, Callen Joseph Bragdon, Carl Vardeman, Carlo Salerno, Casey Comisky, Catherine Louise Hayman, Catherine R Landers, Cathy Zimov, Charles David Coleman, Charles Robert Painter, Christopher Ince Jr, Conor Lynagh, Dmitrii Malaniia, Douglas Craig Wheeler, Douglas Robertson Dr, Vera Simon, Emanuele Chisari, Eric Lim Jit Kai, Farah Rezae, Ferenc Lengyel, Flavian Tabotta, Franco Padelletti, Frisno Boström, Gary O Gross, George McIlvaine, Gil Beecher, Gregory T Hansen, Guido de Jong, Harald Feldmann, Jami Lynne Borman, Jamie Quinn, Jane Norrgard, Jason Truong, Jasper A Diderich, Jeffrey Michael Canfield, Jeffrey Photakis, Jesse David Slone, Joanna Madzio, Joanne Mitchell, John Charles Stomieroski, John H Mitch, Johnathan Robert Altenbeck, Jonas Schinkler, Jonathan Barak Weinberg, Joshua David Burbach, João Carlos Sequeira da Costa, Juan Francisco Bada Juarez, Jón Pétur Gunnarsson, Kathleen Diane Harper, Keehyoung Joo, Keith T Clayton, Kenneth E DeFord, Kevin F Scully, Kevin M Gildea, Kirk J Abbey, Kristen Lee Kohli, Kyle Stenner, Kálmán Takács, LaVerne L Poussaint, Larry C Manalo Jr, Larry C Withers, Lilium Carlson, Linda Wei, Luke Ryan Fisher, Lynn Carpenter, Ma Ji-Hwan, Manuel Ricci, Marcus Anthony Belcastro, Marek Leniec, Marie Hohmann, Mark Thompson, Matthew A Thayer, Matthias Gaebel, Michael D Cassidy, Michael Fagiola, Michael Lewis, Michael Pfützenreuter, Michael Simon, Moamen M Elmassry, Noah Benevides, Norah Kathleen Kerr, Nupur Verma, Oak Shannon, Owen Yin, Pascal Wolfteich, Paul Gummersall, Paweł Tłuścik, Peter Gajar, Peter John Triggiani 4th, Rajarshi Guha, Renton Braden Mathew Innes, Ricky Buchanan, Robert Gamble, Robert Leduc, Robert Spearing, Rodrigo Luccas Corrêa Dos Santos Gomes, Roger D Estep, Ryan DeWitt, Ryan Moore, Scott G Shnider, Scott J Zaccanelli, Sergey Kuznetsov, Sergio Burillo-Sanz, Seán Mooney, Sidoruk Vasiliy, Slava S Butkovich, Spencer Bruce Hudson, Spencer Len Pote, Stephen Phillip Denne, Steven A Schwegmann, Sumanth Ratna, Susan C Kleinfelter, Thomas Bausewein, Thomas J George, Tobias Scherf de Almeida, Ulas Yeginer, Walter Barmettler, Warwick Robert Pulley, William Scott Wright, Willyanto, Wyatt Lansford, Xavier Hochart, Yoan Anthony Skander Gaiji, Yuriy Lagodich, Vivier Christian

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA 98105.
² Institute for Protein Design, University of Washington, Seattle, WA 98105.
³ Graduate Program in Molecular Engineering, University of Washington, Seattle, WA 98105.
⁴ Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138.
⁵ Department of Biochemistry, University of Washington, Seattle, WA 98105; dabaker@uw.edu so@fas.harvard.edu.
⁶ Howard Hughes Medical Institute, University of Washington, Seattle, WA 98105.
⁷ Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138; dabaker@uw.edu so@fas.harvard.edu.
⁸ John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA 02138.

PMID: 33712545
PMCID: PMC7980421
DOI: 10.1073/pnas.2017228118

Abstract

The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen's thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.

Keywords: energy landscape; machine learning; protein design; sequence optimization; stability prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
Protein sequence design. (A) The goal of fixed backbone protein design is to find a sequence that best specifies the desired structure (P). Traditional energy-based methods have approached the problem heuristically, focusing solely on minimizing the energy of the target conformation in the hope that any stable alternative conformation is unlikely to arise by chance. However, this narrow focus on a single desired structure can produce solutions with low-energy alternative states, as suggested in the energy landscape of sequence α. An ideal method would instead find the sequence that maximizes the probability of the desired structure over all other states. Such a method would select sequence β. (B) Overview of trRosetta fixed backbone sequence design method. Starting with a random matrix of sequence length by number of amino acids (logits), the maximum value at each position is taken to generate a sequence, which is fed into the trRosetta model. The output is the predicted distribution of distances, angles, and dihedrals for every pair of residues (here, we only show distances). The loss is defined as the difference between the target and prediction, and the gradient is computed to minimize the loss. After normalization, the gradient is applied to the logits, and the process is repeated until convergence.

**Fig. 2.**
trRosetta predicts properties of the folding energy landscape. (A) trRosetta better predicts which designs will have high Boltzmann probabilities than Rosetta-based energy calculations, which only see the target conformation (classifications for P_near > 0.8, *AUC*_trRosetta = 0.81 vs. *AUC*_Rosetta = 0.65, n = 4,204 designs). (B) trRosetta correctly predicts dilution of probability for designs with multiple low-energy conformations (columns 2 to 4) compared with designs with a single global energy minimum (column 1). Structural decoys were binned, and the mean trRosetta score corrected for background $(- log P_{c} (s t r u c t . | s e q .))$ is represented by the color gradient from dark blue (high probability) to red (low probability). (C) Structures of the lowest-energy representatives (indicated by circles on the energy landscape). The designed structures and alternative states are shown on the left and right, respectively, of each column. (D) Selected examples of probability distributions (*Cβ–Cβ* distance prediction) for specific i,j pairs (numbering indicated on the top) demonstrating bimodality. The actual distances observed in the designed and alternative structures are indicated by vertical lines (blue and red, respectively) and shown as spheres on the corresponding structures. More examples are shown in *SI Appendix*, Fig. S4, and an analysis of the prediction of bimodality from distributions of individual i,j pairs can be found in *SI Appendix*, Fig. S5.

**Fig. 3.**
(A and B) trRosetta predicts scaffold designability and experimental success. (A) Across different topologies, trRosetta predictions are better correlated with experimental protease stability—a measure of folding success—than Rosetta energy (R²_trRosetta = 0.79 vs. R²_Rosetta = 0.00, P value < 0.0001) (*SI Appendix*, *Methods*). Data points are the topology-specific mean values, and error bars represent SDs (eight topologies, N_tot = 30,159 designs). Without topological averaging, the correlation decreases (R²_trRosetta = 0.20 vs. R²_Rosetta = 0.03, P value < 0.0001) because intratopological differences are not well captured by trRosetta (*SI Appendix*, Fig. S6 has details). (B) trRosetta is significantly better at discriminating experimental success (expression, nonaggregation, and having correct secondary structure content) than Rosetta energy (*AUC*_trRosetta = 0.81 vs. *AUC*_Rosetta = 0.64). Data from 145 Foldit-generated designs (16). (C and D) Designing with a hybrid trRosetta–Rosetta protocol disfavors off-target states. (C) Examples of energy landscapes for two Foldit-generated backbones, each designed with trRosetta, Rosetta, and the trRosetta–Rosetta hybrid protocol. (D) The hybrid protocol improves the quality of the resulting energy landscapes, as determined by the P_near quantity. trRosetta on its own also improves funnels but only superficially (better performance than Rosetta in the lower P_near regime). It does not, however, generate a deep minimum in the vicinity of the designed state (poorer performance than Rosetta in the high P_near regime). (E) The local sequence–structure relationship is idealized in trRosetta designs compared with both native proteins and Rosetta designs. The native structures that were used for redesign were at most 30% sequence identical to any protein in the trRosetta training dataset. Local sequence–structure agreements were measured as the average *RMSD* between the designed structure and nine-residue fragments from the PDB that were selected based on the sequence of the design. (F) For the same set of native backbones, trRosetta redesigns have a more native-like distribution of hydrophobic residues (F, I, L, V, M, W, Y) on the protein surface than Rosetta redesigns. The degree of burial was assessed with the software DEPTH (33), which computes the distance in ångströms between each residue and bulk solvent. A breakdown by amino acid is in *SI Appendix*, Fig. S13.

See this image and copyright information in PMC

References

1. Jones D. T., De novo protein design using pairwise potentials and a genetic algorithm. Protein Sci. 3, 567–574 (1994). - PMC - PubMed
1. Kuhlman B., et al., Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003). - PubMed
1. Dahiyat B. I., Mayo S. L., De novo protein design: Fully automated sequence selection. Science 278, 82–87 (1997). - PubMed
1. Ingraham J., Garg V., Barzilay R., Jaakkola T., Generative models for graph-based protein design. NeurIPS Proc. 32, 15820–15831 (2019).
1. Greener J. G., Moffat L., Jones D. T., Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018). - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein sequence design by conformational landscape optimization

Collaborators

Affiliations

Protein sequence design by conformational landscape optimization

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials