Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 8;13(1):746.
doi: 10.1038/s41467-022-28313-9.

Protein sequence design with a learned potential

Affiliations

Protein sequence design with a learned potential

Namrata Anand et al. Nat Commun. .

Abstract

The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Fully learned sequence and rotamer design onto fixed protein backbones.
A Sequences are designed onto fixed protein backbones by (1) iteratively selecting a candidate residue position, (2) using a neural network model to sample amino-acid type and conformation, and (3) optimizing the negative pseudo-log-likelihood of the sequence under the model via simulated annealing. (Inset, left) Given the local chemical environment around a residue position (box, dashed, not to scale), residue type and rotamer angles are sampled from network-predicted distributions. B The neural network model is trained to predict residue identity and rotamer angles in an autoregressive fashion, conditioning on ground-truth data (black). The trained classifier predicts amino-acid type as well as rotamer angles conditioned on the amino-acid type. Cross-entropy loss objectives are shown in pink.
Fig. 2
Fig. 2. Generalization of model design to unseen topologies.
Data are presented as mean values ± 95% CI or as box plots with a median center, bounds of boxes corresponding to interquartile range (IQR), whisker length 1.5*IQR, and outliers rendered outside of this range. A The trained model is used to either repack rotamers or design entirely new sequences onto unseen test set structures with non-train-set CATH topologies. B, C Model-guided rotamer recovery for native test cases. B Rotamer repacking accuracy for buried core residues versus solvent-exposed residues as a function of degree cutoff. C 5 models superimposed with side chains shown as black lines compared to the native conformation shown in purple outline for test case 3mx7. DH Performance of sequence design onto test case backbones. D Native sequence recovery rate across 50 designs for all residues vs. buried core residues. E Position-wise amino-acid distributions for test case 1cc8. Columns are normalized. (Top) Native sequence and aligned homologous sequences from MSA (n = 670). (Bottom) Model designs (n = 50). F Cross-entropy of Psipred secondary structure prediction from a sequence with respect to DSSP assignments,–,. G Fraction occurrence of glycines at positive ϕ backbone positions across test cases. H Fraction occurrence of N-terminal helical capping residues across designs for test cases with capping positions. I, J Far UV circular dichroism (CD) spectroscopy data for selected test case designs. I Mean residue ellipticity ΘMRW (103 deg cm2 dmol−1) for CD wavelength scans at 20 °C for native structures (blue, dashed) vs. select model designs (orange, solid) 1acf d3, 1bkr d2, 1cc8 d2, and 3mx7 d4. Sequence identity to native reported within each panel. J Thermal melting curves for select model designs monitoring θMRW (103 deg cm2 dmol−1) at 222 nm or 217 nm for 3mx7.
Fig. 3
Fig. 3. Model captures sequence–structure relationship.
A, B Decoy ranking by model negative pseudo-log-likelihood (PLL) of the native sequence. A Model negative PLL vs. alpha-carbon RMSD (Å) to the native structure for Rosetta ab initio decoys. Points are colored by average side-chain RMSD to native (Å). In some cases, the model assigns low negative PLL to high RMS backbones; for example, for 1cc8 an alternative pattern of beta-strand pairing is shown (Inset). B Model negative PLL of low backbone RMS structures (CA RMSD < 5 Å) vs. average side-chain RMSD (Å). Box highlights low model negative PLL assigned to low side-chain RMSD decoys. C Spearman rank correlation between model negative PLL or Rosetta energy vs. structure alpha-carbon RMSD (Å) as a function of increasing RMSD cutoff. In the low RMS regime (<5 Å), the model and Rosetta are able to rank low RMS structures to a similar extent.
Fig. 4
Fig. 4. Model discovery of novel sequence features.
A Overlay of crystal structures (blue) with template TIM-barrel backbone for F2C (pink) and F15C (yellow). Alpha-carbon RMSD (Å) and sequence identity to sTIM-11 sequence are given below structures. B Percent sequence identity (indicated by graph edge color and thickness) between TIM-barrel subunits for model TIM-barrel designs (orange) and previously characterized sequences for the same scaffold (blue), including sTIM-11 (5bvl, S11), DeNovoTIM15 (6wvs, D15), and DeNovoTIMs (N6, N13, N14a, N14b). N14a and N14b are two-quarters of the two-fold symmetric DeNovoTIM14. C, D Investigation of sequence features for the symmetric subunit near the top of the barrel (cyan shadow) and the helix interface between symmetric subunits (orange shadow) for C F2C and D F15C. Crystal structures are shown in blue overlaid with the design template (pink—F2C, yellow—F15C). EH Closer inspection of novel sequence features designed by the model.

References

    1. Whitehead TA, et al. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nat. Biotechnol. 2012;30:543. - PMC - PubMed
    1. Silva D-A, et al. De novo design of potent and selective mimics of il-2 and il-15. Nature. 2019;565:186. - PMC - PubMed
    1. Correia BE, et al. Proof of principle for epitope-focused vaccine design. Nature. 2014;507:201. - PMC - PubMed
    1. Tinberg CE, et al. Computational design of ligand-binding proteins with high affinity and selectivity. Nature. 2013;501:212. - PMC - PubMed
    1. Glasgow, A. A. et al. Computational design of a modular protein sense/response system. Science366, 1024–1028 (2019). - PMC - PubMed

Publication types