. 2025 Aug;43(8):1288-1298.

doi: 10.1038/s41587-024-02395-w. Epub 2024 Sep 25.

Multistate and functional protein design using RoseTTAFold sequence space diffusion

Sidney Lyayuga Lisanza^#^{1

2

3}, Jacob Merle Gershon^#^{1

2

4}, Samuel W K Tipps^#^{1

2}, Jeremiah Nelson Sims^#^{2

5}, Lucas Arnoldt^#^{1

2

6}, Samuel J Hendel^{1

2}, Miriam K Simma⁷, Ge Liu^{1

2}, Muna Yase^{1

2

4}, Hongwei Wu⁷, Claire D Tharp⁷, Xinting Li^{1

2}, Alex Kang², Evans Brackenbrough², Asim K Bera², Stacey Gerben², Bruce J Wittmann⁸, Andrew C McShan⁷, David Baker^{9

10

11}

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA, USA.
² Institute for Protein Design, University of Washington, Seattle, WA, USA.
³ Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA.
⁴ Department of Molecular Engineering, University of Washington, Seattle, WA, USA.
⁵ Molecular & Cellular Biology, Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
⁶ Faculty of Engineering Sciences, Heidelberg University, Heidelberg, Germany.
⁷ School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA.
⁸ Office of the Chief Scientific Officer, Microsoft, Redmond, WA, USA.
⁹ Department of Biochemistry, University of Washington, Seattle, WA, USA. dabaker@uw.edu.
¹⁰ Institute for Protein Design, University of Washington, Seattle, WA, USA. dabaker@uw.edu.
¹¹ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. dabaker@uw.edu.

^# Contributed equally.

PMID: 39322764
PMCID: PMC12339374
DOI: 10.1038/s41587-024-02395-w

Multistate and functional protein design using RoseTTAFold sequence space diffusion

Sidney Lyayuga Lisanza et al. Nat Biotechnol. 2025 Aug.

. 2025 Aug;43(8):1288-1298.

doi: 10.1038/s41587-024-02395-w. Epub 2024 Sep 25.

Authors

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA, USA.
² Institute for Protein Design, University of Washington, Seattle, WA, USA.
³ Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA.
⁴ Department of Molecular Engineering, University of Washington, Seattle, WA, USA.
⁵ Molecular & Cellular Biology, Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
⁶ Faculty of Engineering Sciences, Heidelberg University, Heidelberg, Germany.
⁷ School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA.
⁸ Office of the Chief Scientific Officer, Microsoft, Redmond, WA, USA.
⁹ Department of Biochemistry, University of Washington, Seattle, WA, USA. dabaker@uw.edu.
¹⁰ Institute for Protein Design, University of Washington, Seattle, WA, USA. dabaker@uw.edu.
¹¹ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. dabaker@uw.edu.

^# Contributed equally.

PMID: 39322764
PMCID: PMC12339374
DOI: 10.1038/s41587-024-02395-w

Erratum in

Publisher Correction: Multistate and functional protein design using RoseTTAFold sequence space diffusion.
Lisanza SL, Gershon JM, Tipps SWK, Sims JN, Arnoldt L, Hendel SJ, Simma MK, Liu G, Yase M, Wu H, Tharp CD, Li X, Kang A, Brackenbrough E, Bera AK, Gerben S, Wittmann BJ, McShan AC, Baker D. Lisanza SL, et al. Nat Biotechnol. 2025 Aug;43(8):1384. doi: 10.1038/s41587-024-02456-0. Nat Biotechnol. 2025. PMID: 39375454 Free PMC article. No abstract available.

Abstract

Protein denoising diffusion probabilistic models are used for the de novo generation of protein backbones but are limited in their ability to guide generation of proteins with sequence-specific attributes and functional properties. To overcome this limitation, we developed ProteinGenerator (PG), a sequence space diffusion model based on RoseTTAFold that simultaneously generates protein sequences and structures. Beginning from a noised sequence representation, PG generates sequence and structure pairs by iterative denoising, guided by desired sequence and structural protein attributes. We designed thermostable proteins with varying amino acid compositions and internal sequence repeats and cage bioactive peptides, such as melittin. By averaging sequence logits between diffusion trajectories with distinct structural constraints, we designed multistate parent-child protein triples in which the same sequence folds to different supersecondary structures when intact in the parent versus split into two child domains. PG design trajectories can be guided by experimental sequence-activity data, providing a general approach for integrated computational and experimental optimization of protein function.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Overview of PG.**
a, Comparison of diffusion in sequence and structure space. PG and RFdiffusion take as input noised sequence (PG) or structure (RFdiffusion) data and problem specific sequence and structure constraints. At each denoising step, the RoseTTAFold architecture generates complete protein sequences and structures, and this is used to generate the next step in the trajectory in sequence (PG) or structure (RFdiffusion) space. Although specific structural or sequence features can be fixed in the input to RoseTTAFold in both approaches, biases toward particular sequence features during the diffusion update at each step are more readily incorporated in PG (as are biases toward structural features, such as symmetry, in RFDiffusion). b, Schematic of PG inference trajectory. At each step in the diffusion process the sequence x₀ is predicted from sequence x_t by RF conditioned on any desired structural information, combined with any desired sequence bias, and noised to generate the x_t−1. This process is repeated for T steps as the sequence–structure pair converges on a high-confidence solution shaped by the structural and sequence guidance information. c, Iterative design schematic demonstrating how PG can be used in an experimental feedback loop. Designs generated by the model are evaluated for activity; a surrogate function approximating sequence to function relationships is fit; and gradients from the surrogate can then be used to guide PG toward active design space. d, In silico demonstration of iterative design using GB1 fitness landscape for binding and comparison with Bayesian optimization (BO). In round 0, not shown in the plot, 96 designs are generated with PG without guidance, and a surrogate function is trained to discriminate high and low activity designs. In rounds 1–3, gradient-based guidance is used to generate 96 designs for each method; a surrogate function is fit; and the process is repeated. Line plots show maximum activity sampled, and box plots show distribution sampled over the batch of 96. Mean activities for each round are statistically significant between the two populations (P < 0.05, two-sided Mann–Whitney U-test, n = 96 designs per round). Box plots boundaries indicate upper and lower quartiles, and whiskers indicate the nearest quartile + 1.5× interquartile range. seq, sequence; str, structure.

**Fig. 2. Design of proteins with specified sequence composition.**
a, Amino acid compositional bias schematic. b, Comparison of amino acid frequency in unconditional (gray) and amino acid biased (purple) generation; separate PG trajectories were carried out for each enriched amino acid. Error bars are standard deviation. Biased distributions are significantly different from unconditional amino acid frequencies (P < 0.05, two-sided Mann–Whitney U-test, n = 200 designs per amino acid). Box plot boundaries indicate upper and lower quartiles; whiskers indicate the nearest quartile + 1.5× interquartile range; and the center line is the median. c, Multidimensional scaling of native and amino acid biased sequences shows that they occupy distinct regions of sequence space. d, Hydropathy guidance. Biasing the sequence toward or away from hydrophobic amino acids results in a shifted distribution of hydropathy scores compared to unconditional generation (P < 0.05 two-sided Mann–Whitney U-test, n = 122 designs per condition). e, Experimental validation of cysteine biased designs (design in gray, AF2 in purple). Proteins are monomeric by SEC and alpha helical by CD at 25 °C and 95 °C. Mass spectrometry indicates the presence of the designed number of disulfide bonds. f, Experimental validation of tryptophan biased designs (design in gray, AF2 in purple). Designs are monomeric by SEC, have considerably higher absorbance at 280 nm than unconditional designs and are alpha helical by CD. g, Experimental validation of histidine and methionine biased designs (design in gray, AF2 in purple). h, Experimental validation of valine biased designs (design in gray, AF2 in purple). Valines highlighted in pink on the designs are present in the beta-fold secondary structure. CD traces and melt curves at 222 nm are to the right of the designs. CD traces and melt curves at 222 nm are to the right of the designs. aa, amino acid.

**Fig. 3. Design of sequence repeat proteins with PG.**
a, Symmetric sequence diffusion to design proteins with sequence symmetry. b, Experimental validation of sequence repeat proteins. Designs in gray are overlaid with AF2 predictions in purple, and asymmetric units are highlighted in pink. SEC and CD traces and melting curves demonstrate stability of these designs. c, 3.70-Å crystal structure of designed repeat protein: AF2 model in gray, crystal structure in purple and asymmetric unit in pink. Box on the right highlights the accuracy of designed side chains in the asymmetric unit.

**Fig. 4. Scaffolding bioactive peptides and intrinsic barcodes with PG.**
a, Schematic overview of functional peptide scaffolding for downstream tasks such as protease cleavage for lysis and peptide barcoding. b, Sequence-only motif scaffolding and secondary structure conditioning to generate proteins with embedded functional sequences. Cleavage sites can be specified at the N or C terminus of the peptide to allow for protease cleavage. c, In silico design metrics for sequence-only bioactive peptide scaffolding. RMSD of AF2 predictions to designs on the top and AF2 pLDDT of designs on the bottom. Box plot boundaries indicate upper and lower quartiles; whiskers indicate the nearest quartile + 1.5× interquartile range; and the center line is the median. n = 2,000 designs per condition. d, Mass spec peptide barcoding assay. Scaffolding barcodes with PG results in soluble and monomeric designs by SEC. SEC traces for individual designs are in gray. When the same designs are expressed in a pooled library (black), and fractions are digested with trypsin, analytical mass spectroscopy of each fraction is able to recapitulate the SEC trace shown in purple. e, Melittin scaffolded designs with furin cleavage site. Designs are shown in gray, and AF2-predicted structures are shown in purple, with melittin peptide highlighted in pink. Designs are soluble and monomeric by SEC and folded with helical secondary structure by CD. f, Melittin scaffolded design D12. D12 design model is in gray; AF2-predicted structure is overlayed in purple for scaffold; cyan is for the cleavage site; and pink is for melittin. SEC fraction of monomeric D12 used for downstream assays is highlighted with the purple bar. CD trace of D12 is consistent with the designed helical secondary structure. g, Representative SDS-PAGE of uncleaved D12 (18 kD), cleaved D12 (15 kD) and melittin peptide (3 kD) (n = 3 biological replicates). h, Mass spec of the cleavage reaction products confirms the presence of uncleaved D12, cleaved D12 and melittin. Melittin mass was calculated with an additional c-terminal ‘GS’ due to the expression vector used. i, Absorbance at 450 nm for six technical replicates of washed RBCs after incubation with design with and without furin protease. Positive controls Triton X-100 and melittin are shown to the left of the vertical bar. Design with furin lyses RBCs significantly more than samples without design (P = 0.002, two-sided Mann–Whitney U-test) or furin (P = 0.005, two-sided Mann–Whitney U-test) and is on par with positive controls Triton X-100 (P = 0.127, two-sided Mann–Whitney U-test) and melittin (P = 0.132, two-sided Mann–Whitney U-test). Source data

**Fig. 5. Multistate design with PG.**
a, Multistate DSSP conditioning is used to generate a sequence with an alpha/beta fold in the parent state and all alpha in the child A and child B states. b, Implementation of multistate DSSP sequence conditioning. Different DSSP conditioning strings are applied to a full-length parent sequence and two subsequences (child A and child B). RoseTTAFold predictions and model logits are output for parent, child A and child B. A linear combination of output logits is used as a potential to guide the model toward finding one sequence that satisfies all DSSP conditioning strings for parent, child A and child B. c, MS1 family adopts distinct folds by CD. Top, high pLDDT design and AF2 models of family MS1. Bottom, CD spectra and deconvolution of family MS1 indicating 26% beta content in the parent compared to 4% beta content in child A and child B, respectively. d, ACS of ¹H_N and ¹⁵N chemical shifts values obtained from MS1–MS4 HSQC spectra. Reference average ACS values of primarily α-helical proteins (red circle) and primarily β-sheet proteins (yellow square) are shown calculated from ¹H_N–¹⁵N correlations using chemical shift information obtained from the Biological Magnetic Resonance Bank. ACS values are compared for multistate sequences among parent (α/β mix fold), child A (α-helical fold) and child B (α-helical fold). MS1 in pink, MS2 in purple, MS3 in blue, MS4 in green. MS2 (e) and MS3 (f) families are designed by PG to adopt distinct folds in the parent and child states with high AF2 confidence (top row). HSQC overlays of MS2 and MS3 child A and B compared to parent (bottom row; ω indicates chemical shift). NMR structures of MS2 and MS3 parent fold into the intended secondary structures with atomic-level accuracy (bottom middle).

See this image and copyright information in PMC

Cited by

Engineering Dehalogenase Enzymes Using Variational Autoencoder-Generated Latent Spaces and Microfluidics.
Kohout P, Vasina M, Majerova M, Novakova V, Damborsky J, Bednar D, Marek M, Prokop Z, Mazurenko S. Kohout P, et al. JACS Au. 2025 Feb 13;5(2):838-850. doi: 10.1021/jacsau.4c01101. eCollection 2025 Feb 24. JACS Au. 2025. PMID: 40017771 Free PMC article.
Supervised fine-tuning of pre-trained antibody language models improves antigen specificity prediction.
Wang M, Patsenker J, Li H, Kluger Y, Kleinstein SH. Wang M, et al. PLoS Comput Biol. 2025 Mar 31;21(3):e1012153. doi: 10.1371/journal.pcbi.1012153. eCollection 2025 Mar. PLoS Comput Biol. 2025. PMID: 40163503 Free PMC article.
Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites.
Wolf B, Shehu P, Brenker L, von Bachmann AL, Kroell AS, Southern N, Holderbach S, Eigenmann J, Aschenbrenner S, Mathony J, Niopek D. Wolf B, et al. Nat Methods. 2025 Aug;22(8):1698-1706. doi: 10.1038/s41592-025-02741-z. Epub 2025 Aug 4. Nat Methods. 2025. PMID: 40759748 Free PMC article.
Discovery, design, and engineering of enzymes based on molecular retrobiosynthesis.
Chen A, Peng X, Shen T, Zheng L, Wu D, Wang S. Chen A, et al. mLife. 2025 Mar 28;4(2):107-125. doi: 10.1002/mlf2.70009. eCollection 2025 Apr. mLife. 2025. PMID: 40313979 Free PMC article. Review.
Ligand-Induced Biased Activation of GPCRs: Recent Advances and New Directions from In Silico Approaches.
Hashem S, Dougha A, Tufféry P. Hashem S, et al. Molecules. 2025 Feb 25;30(5):1047. doi: 10.3390/molecules30051047. Molecules. 2025. PMID: 40076272 Free PMC article. Review.

See all "Cited by" articles

References

1. Huang, P.-S. et al. RosettaRemodel: a generalized framework for flexible backbone protein design. PLoS ONE6, e24109 (2011). - PMC - PubMed
1. Wang, J., Watson, J. L. & Lisanza, S. L. Protein design using structure-prediction networks: AlphaFold and RoseTTAFold as protein structure foundation models. Cold Spring Harb. Perspect. Biol.16, a041472 (2024). - PMC - PubMed
1. Winnifrith, A., Outeiral, C. & Hie, B. Generative artificial intelligence for de novo protein design. Preprint at arXiv10.48550/arXiv.2310.09685 (2023). - PubMed
1. Chu, A. E., Lu, T. & Huang, P.-S. Sparks of function by de novo protein design. Nat. Biotechnol.42, 203–215 (2024). - PMC - PubMed
1. Notin, P., Rollins, N., Gal, Y., Sander, C. & Marks, D. Machine learning for functional protein design. Nat. Biotechnol.42, 216–228 (2024). - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multistate and functional protein design using RoseTTAFold sequence space diffusion

Affiliations

Multistate and functional protein design using RoseTTAFold sequence space diffusion

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Erratum in

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources