. 2022 Feb 8;13(1):746.

doi: 10.1038/s41467-022-28313-9.

Protein sequence design with a learned potential

Namrata Anand¹, Raphael Eguchi², Irimpan I Mathews³, Carla P Perez⁴, Alexander Derry⁵, Russ B Altman^{1

6}, Po-Ssu Huang⁷

Affiliations

¹ Department of Bioengineering, Stanford University, Stanford, CA, USA.
² Department of Biochemistry, Stanford University, Stanford, CA, USA.
³ Stanford Synchrotron Radiation Lightsource, Menlo Park, CA, 94025, USA.
⁴ Biophysics Program, Stanford University, Stanford, CA, USA.
⁵ Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA.
⁶ Departments of Genetics and Medicine, Stanford University, Stanford, CA, USA.
⁷ Department of Bioengineering, Stanford University, Stanford, CA, USA. possu@stanford.edu.

PMID: 35136054
PMCID: PMC8826426
DOI: 10.1038/s41467-022-28313-9

Protein sequence design with a learned potential

Namrata Anand et al. Nat Commun. 2022.

. 2022 Feb 8;13(1):746.

doi: 10.1038/s41467-022-28313-9.

Authors

Namrata Anand¹, Raphael Eguchi², Irimpan I Mathews³, Carla P Perez⁴, Alexander Derry⁵, Russ B Altman^{1

6}, Po-Ssu Huang⁷

Affiliations

¹ Department of Bioengineering, Stanford University, Stanford, CA, USA.
² Department of Biochemistry, Stanford University, Stanford, CA, USA.
³ Stanford Synchrotron Radiation Lightsource, Menlo Park, CA, 94025, USA.
⁴ Biophysics Program, Stanford University, Stanford, CA, USA.
⁵ Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA.
⁶ Departments of Genetics and Medicine, Stanford University, Stanford, CA, USA.
⁷ Department of Bioengineering, Stanford University, Stanford, CA, USA. possu@stanford.edu.

PMID: 35136054
PMCID: PMC8826426
DOI: 10.1038/s41467-022-28313-9

Abstract

The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Fully learned sequence and rotamer design onto fixed protein backbones.**
A Sequences are designed onto fixed protein backbones by (1) iteratively selecting a candidate residue position, (2) using a neural network model to sample amino-acid type and conformation, and (3) optimizing the negative pseudo-log-likelihood of the sequence under the model via simulated annealing. (Inset, left) Given the local chemical environment around a residue position (box, dashed, not to scale), residue type and rotamer angles are sampled from network-predicted distributions. B The neural network model is trained to predict residue identity and rotamer angles in an autoregressive fashion, conditioning on ground-truth data (black). The trained classifier predicts amino-acid type as well as rotamer angles conditioned on the amino-acid type. Cross-entropy loss objectives are shown in pink.

**Fig. 2. Generalization of model design to unseen topologies.**
Data are presented as mean values ± 95% CI or as box plots with a median center, bounds of boxes corresponding to interquartile range (IQR), whisker length 1.5*IQR, and outliers rendered outside of this range. A The trained model is used to either repack rotamers or design entirely new sequences onto unseen test set structures with non-train-set CATH topologies. B, C Model-guided rotamer recovery for native test cases. B Rotamer repacking accuracy for buried core residues versus solvent-exposed residues as a function of degree cutoff. C 5 models superimposed with side chains shown as black lines compared to the native conformation shown in purple outline for test case *3mx7*. D–H Performance of sequence design onto test case backbones. D Native sequence recovery rate across 50 designs for all residues vs. buried core residues. E Position-wise amino-acid distributions for test case *1cc8*. Columns are normalized. (Top) Native sequence and aligned homologous sequences from MSA (n = 670). (Bottom) Model designs (n = 50). F Cross-entropy of Psipred secondary structure prediction from a sequence with respect to DSSP assignments^,–,. G Fraction occurrence of glycines at positive ϕ backbone positions across test cases. H Fraction occurrence of N-terminal helical capping residues across designs for test cases with capping positions. I, J Far UV circular dichroism (CD) spectroscopy data for selected test case designs. I Mean residue ellipticity Θ_MRW (10³ deg cm² dmol⁻¹) for CD wavelength scans at 20 °C for native structures (blue, dashed) vs. select model designs (orange, solid) *1acf d3*, *1bkr d2*, *1cc8 d2*, and *3mx7 d4*. Sequence identity to native reported within each panel. J Thermal melting curves for select model designs monitoring θ_MRW (10³ deg cm² dmol⁻¹) at 222 nm or 217 nm for *3mx7*.

**Fig. 3. Model captures sequence–structure relationship.**
A, B Decoy ranking by model negative pseudo-log-likelihood (PLL) of the native sequence. A Model negative PLL vs. alpha-carbon RMSD (Å) to the native structure for Rosetta ab initio decoys. Points are colored by average side-chain RMSD to native (Å). In some cases, the model assigns low negative PLL to high RMS backbones; for example, for *1cc8* an alternative pattern of beta-strand pairing is shown (Inset). B Model negative PLL of low backbone RMS structures (CA RMSD < 5 Å) vs. average side-chain RMSD (Å). Box highlights low model negative PLL assigned to low side-chain RMSD decoys. C Spearman rank correlation between model negative PLL or Rosetta energy vs. structure alpha-carbon RMSD (Å) as a function of increasing RMSD cutoff. In the low RMS regime (<5 Å), the model and Rosetta are able to rank low RMS structures to a similar extent.

**Fig. 4. Model discovery of novel sequence features.**
A Overlay of crystal structures (blue) with template TIM-barrel backbone for F2C (pink) and F15C (yellow). Alpha-carbon RMSD (Å) and sequence identity to sTIM-11 sequence are given below structures. B Percent sequence identity (indicated by graph edge color and thickness) between TIM-barrel subunits for model TIM-barrel designs (orange) and previously characterized sequences for the same scaffold (blue), including sTIM-11 (*5bvl*, S11), DeNovoTIM15 (*6wvs*, D15), and DeNovoTIMs (N6, N13, N14a, N14b). N14a and N14b are two-quarters of the two-fold symmetric DeNovoTIM14. C, D Investigation of sequence features for the symmetric subunit near the top of the barrel (cyan shadow) and the helix interface between symmetric subunits (orange shadow) for C F2C and D F15C. Crystal structures are shown in blue overlaid with the design template (pink—F2C, yellow—F15C). E–H Closer inspection of novel sequence features designed by the model.

See this image and copyright information in PMC

References

1. Whitehead TA, et al. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat. Biotechnol. 2012;30:543. - PMC - PubMed
1. Silva D-A, et al. De novo design of potent and selective mimics of il-2 and il-15. Nature. 2019;565:186. - PMC - PubMed
1. Correia BE, et al. Proof of principle for epitope-focused vaccine design. Nature. 2014;507:201. - PMC - PubMed
1. Tinberg CE, et al. Computational design of ligand-binding proteins with high affinity and selectivity. Nature. 2013;501:212. - PMC - PubMed
1. Glasgow, A. A. et al. Computational design of a modular protein sense/response system. Science366, 1024–1028 (2019). - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

T32 GM120007/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein sequence design with a learned potential

Affiliations

Protein sequence design with a learned potential

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources