. 2022 Mar 14;17(3):e0265020.

doi: 10.1371/journal.pone.0265020. eCollection 2022.

Large-scale design and refinement of stable proteins using sequence-only models

Jedediah M Singer¹, Scott Novotney¹, Devin Strickland², Hugh K Haddox³, Nicholas Leiby¹, Gabriel J Rocklin⁴, Cameron M Chow³, Anindya Roy³, Asim K Bera³, Francis C Motta⁵, Longxing Cao³, Eva-Maria Strauch⁶, Tamuka M Chidyausiku³, Alex Ford³, Ethan Ho⁷, Alexander Zaitzeff¹, Craig O Mackenzie⁸, Hamed Eramian⁹, Frank DiMaio³, Gevorg Grigoryan¹⁰, Matthew Vaughn⁷, Lance J Stewart³, David Baker³, Eric Klavins²

Affiliations

¹ Two Six Technologies, Arlington, Virginia, United States of America.
² Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, United States of America.
³ Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America.
⁴ Department of Pharmacology and Center for Synthetic Biology, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America.
⁵ Department of Mathematical Sciences, Florida Atlantic University, Boca Raton, Florida, United States of America.
⁶ Department of Pharmaceutical and Biomedical Sciences, University of Georgia, Athens, Georgia, United States of America.
⁷ Texas Advanced Computing Center, Austin, Texas, United States of America.
⁸ Quantitative Biomedical Sciences Graduate Program, Dartmouth College, Hanover, New Hampshire, United States of America.
⁹ Netrias, Cambridge, Massachusetts, United States of America.
¹⁰ Departments of Computer Science and Biological Sciences, Dartmouth College, Hanover, New Hampshire, United States of America.

PMID: 35286324
PMCID: PMC8920274
DOI: 10.1371/journal.pone.0265020

Large-scale design and refinement of stable proteins using sequence-only models

Jedediah M Singer et al. PLoS One. 2022.

. 2022 Mar 14;17(3):e0265020.

doi: 10.1371/journal.pone.0265020. eCollection 2022.

Authors

Affiliations

¹ Two Six Technologies, Arlington, Virginia, United States of America.
² Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, United States of America.
³ Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America.
⁴ Department of Pharmacology and Center for Synthetic Biology, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America.
⁵ Department of Mathematical Sciences, Florida Atlantic University, Boca Raton, Florida, United States of America.
⁶ Department of Pharmaceutical and Biomedical Sciences, University of Georgia, Athens, Georgia, United States of America.
⁷ Texas Advanced Computing Center, Austin, Texas, United States of America.
⁸ Quantitative Biomedical Sciences Graduate Program, Dartmouth College, Hanover, New Hampshire, United States of America.
⁹ Netrias, Cambridge, Massachusetts, United States of America.
¹⁰ Departments of Computer Science and Biological Sciences, Dartmouth College, Hanover, New Hampshire, United States of America.

PMID: 35286324
PMCID: PMC8920274
DOI: 10.1371/journal.pone.0265020

Abstract

Engineered proteins generally must possess a stable structure in order to achieve their designed function. Stable designs, however, are astronomically rare within the space of all possible amino acid sequences. As a consequence, many designs must be tested computationally and experimentally in order to find stable ones, which is expensive in terms of time and resources. Here we use a high-throughput, low-fidelity assay to experimentally evaluate the stability of approximately 200,000 novel proteins. These include a wide range of sequence perturbations, providing a baseline for future work in the field. We build a neural network model that predicts protein stability given only sequences of amino acids, and compare its performance to the assayed values. We also report another network model that is able to generate the amino acid sequences of novel stable proteins given requested secondary sequences. Finally, we show that the predictive model-despite weaknesses including a noisy data set-can be used to substantially increase the stability of both expert-designed and model-generated proteins.

PubMed Disclaimer

Conflict of interest statement

JMS and AZ are employed by Two Six Technologies, which has filed a patent on a portion of the technology described in this manuscript. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

**Fig 1. Overview of design and refinement.**
Proteins are designed, either by an expert using Rosetta or dTERMen software or by a neural network model that transforms secondary sequences into primary sequences. These designs are refined to maximize stability via an iterative procedure. At each step, the stability of all possible single-site substitutions is predicted by another neural network model. The mutants with the highest predicted stability are saved and used as seeds for the next round of optimization.

**Fig 2. Evaluator model and performance.**
(A) Architecture of Evaluator Model. (1) Input: one-hot encoding of protein’s primary sequence. (2) Three convolutional layers; the first flattens the one-hot encoding to a single dimension, successive filters span longer windows of sequence. Three dense layers (3) yield trypsin and chymotrypsin stability scores (4). The final stability score (5) is the minimum of the two. (6) A separate dense layer from the final convolution layer yields one-hot encoding of the protein’s secondary structure. (B) Success of EM predictions on a library of new designs. We used the EM to predict the stability of 45,840 new protein sequences that the model had not seen before (described later as “Corpus B”); the distribution of predictions is shown in pink. The blue curve shows the fraction of these designs that were empirically stable (stability score >1.0) as a function of the model’s a priori stability predictions (dotted black line: stability threshold for predicted stability). 281 outliers (predicted stability score <-1.0 or >3.0) excluded for clarity. (C) Predicted versus observed stability scores for the same data, with outliers included.

**Fig 3. Refinement and its effects.**
(A) Beam search refinement. Refinement begins with a protein’s amino acid sequence (left, green). All possible single-site substitutions are generated (bold red characters in middle sequences), and they are sorted according to the EM’s prediction of their stability (middle). The design with the highest predicted stability (middle, green) is reserved as the product of refinement at this stage. The k single-site substitutions with the highest predicted stability (middle, green and yellow; k = 2 in this illustration, though we used k = 50 to stabilize proteins) are then used as new bases. For each of the k new bases, the process was repeated, combining all single-site substitutions of all k new bases in the new sorted list (right). In this fashion, we predicted the best mutations of 1–5 amino acid substitutions for each of the base designs. (B) Effect of guided and random substitutions on expert-designed proteins. Guided substitutions (orange) raised the mean stability score from 0.23 in the base population (green) to 1.27 after five amino acid changes, as compared to random substitutions (blue) which dropped it to -0.06. Because stability score is logarithmic, the increase in stability is more than ten-fold after five guided substitutions. Annotated black bars indicate means, notches indicate bootstrapped 95% confidence intervals around the medians, boxes indicate upper and lower quartiles, and whiskers indicate 1.5 times the inter-quartile range.

**Fig 4. Generator model and its performance.**
(A) Architecture of the GM. Adapted for use with protein secondary and primary sequences from [45]. (B) Density plot of experimental stability scores for training designs, designs from the GM, and scrambles of the GM designs. (C) Density plot of trypsin EC₅₀ values. (D) Density plot of chymotrypsin EC₅₀ values.

**Fig 5. Refinement of GM designs, overall and as a function of novelty.**
(A) Effect of guided and random substitutions on designs created by the GM. The base stability score was much higher for this population of designs than for the expert-designed proteins tested, with a mean of 0.67; EM-guided refinement further increased it to 1.67. As with the expert-designed proteins, this demonstrates a ten-fold increase in stability. Random substitutions again had a deleterious effect, dropping mean stability to 0.29. (B) Stability of GM designs, and guided and random substitutions within those designs, as novelty increases. We consider designs to be more novel when BLAST percent identity with the most-similar design in the training corpus is lower.

**Fig 6. Differential effects on stability between guided and random single-site substitutions.**
For each original amino acid (indexed on the y-axis) and each replacement amino acid (indexed on the x-axis), the mean effect on stability when that substitution was guided by the EM is computed, as is the mean effect on stability when that substitution was applied randomly. The difference between these two effects is plotted for each from-to pair that was represented in the data; redder circles indicate that guided substitutions were more beneficial for stability, bluer circles indicate that random substitutions were more beneficial. Circles with heavy black outlines showed a significant difference (two-sample unpaired two-tailed t-test, p < 0.05 uncorrected) between guided and random effects. Bar graphs indicate mean differences in stability score (guided substitutions minus random substitutions) averaged across all replacement amino acids for each original amino acid (left) and and averaged across all original amino acids for each replacement amino acid (bottom).

**Fig 7. Laboratory analyses of GM proteins.**
(A) Results of targeted analyses of twelve GM proteins. All twelve proteins had less than 60% identity with respect to the entire set of training proteins, as calculated by BLAST. Reported topology was predicted by PSIPRED [47] and Rosetta (in that order, when predictions differ). (B) “Life cycle” of one refined protein, nmt_0994_guided_02. The design began with a requested secondary structure fed into the GM. The GM produced a primary sequence (nmt_0994) stochastically translated from that secondary structure; however, the Reverse GM correctly predicted that two of the requested helices were actually merged into one in the generated protein’s structure. EM-guided refinement then changed two residues to tryptophan, which raised the empirical stability score from -0.18 to 1.88. Green characters highlight differences from original sequences. (C) Crystal structure for nmt_0994_guided_02 (dark grey), showing that it also has the three helices predicted by the Reverse GM for its pre-refinement progenitor. It is shown aligned to the structure predicted by AlphaFold2 (cyan). The prediction and the crystal structure have a C_αRMSD of 3.4 Å.

See this image and copyright information in PMC

References

1. Chevalier A, Silva DA, Rocklin GJ, Hicks DR, Vergara R, Murapa P, et al.. Massively parallel de novo protein design for targeted therapeutics. Nature. 2017;550(7674):74–79. doi: 10.1038/nature23912 - DOI - PMC - PubMed
1. Jiang L, Althoff EA, Clemente FR, Doyle L, Röthlisberger D, Zanghellini A, et al.. De Novo Computational Design of Retro-Aldol Enzymes. Science. 2008;319(5868):1387–1391. doi: 10.1126/science.1152692 - DOI - PMC - PubMed
1. King NP, Sheffler W, Sawaya MR, Vollmar BS, Sumida JP, André I, et al.. Computational Design of Self-Assembling Protein Nanomaterials with Atomic Level Accuracy. Science. 2012;336(6085):1171–1174. doi: 10.1126/science.1219364 - DOI - PMC - PubMed
1. Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, et al.. The Rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation. 2017;13(6):3031–3048. doi: 10.1021/acs.jctc.7b00125 - DOI - PMC - PubMed
1. Magliery TJ. Protein stability: computation, sequence statistics, and new experimental methods. Current Opinion in Structural Biology. 2015;33:161—168. doi: 10.1016/j.sbi.2015.09.002 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Large-scale design and refinement of stable proteins using sequence-only models

Affiliations

Large-scale design and refinement of stable proteins using sequence-only models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials