Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 25:2023.05.24.542194.
doi: 10.1101/2023.05.24.542194.

An all-atom protein generative model

Affiliations

An all-atom protein generative model

Alexander E Chu et al. bioRxiv. .

Update in

  • An all-atom protein generative model.
    Chu AE, Kim J, Cheng L, El Nesr G, Xu M, Shuai RW, Huang PS. Chu AE, et al. Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2311500121. doi: 10.1073/pnas.2311500121. Epub 2024 Jun 25. Proc Natl Acad Sci U S A. 2024. PMID: 38916999 Free PMC article.

Abstract

Proteins mediate their functions through chemical interactions; modeling these interactions, which are typically through sidechains, is an important need in protein design. However, constructing an all-atom generative model requires an appropriate scheme for managing the jointly continuous and discrete nature of proteins encoded in the structure and sequence. We describe an all-atom diffusion model of protein structure, Protpardelle, which instantiates a "superposition" over the possible sidechain states, and collapses it to conduct reverse diffusion for sample generation. When combined with sequence design methods, our model is able to co-design all-atom protein structure and sequence. Generated proteins are of good quality under the typical quality, diversity, and novelty metrics, and sidechains reproduce the chemical features and behavior of natural proteins. Finally, we explore the potential of our model conduct all-atom protein design and scaffold functional motifs in a backbone- and rotamer-free way.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Superposition modeling approach and denoising scheme for Protpardelle.
(a) The basic idea of denoising protein structures by integrating an ODE. Given noisy data xt, we can run the denoising network to predict the fully denoised data, x0. Given the quantities xt, x0, and the noise level σt, we can estimate the score, or gradient which points in the direction of data. We can then take a denoising step (integrating the ODE) by choosing a step size ∆σ and computing an updated ∆x on xt which yields slightly denoised data xt−1. We can repeat this many times to iteratively denoise our sample and produce protein samples. The noising process is defined by the marginal distributions, which noise protein structures by simply adding Gaussian noise to the atomic coordinates. The scale of these Gaussians increases linearly with time, thereby inducing mostly linear ODE solution trajectories. In our model, the forward noise process acts only on real proteins (with one sidechain per amino acid), whereas the reverse denoising process acts on the full “superposition” over all possible sidechains. (b) A visualization of the Protpardelle sampling routine for a single residue position. The vertical axis lists the structural elements being denoised (i.e. the atoms of the 20 sidechains in the superposition, plus the backbone atoms). The horizontal axis denotes progression in sampling time, with each amino acid denoting the amino acid predicted for this position at a given timestep. Note that this amino acid prediction can change from step to step. Briefly, at each timestep, we use the predicted amino acid to collapse the superposition and form a “real” yet noisy protein, predict denoised positions for each of the atoms in this protein, and then take a denoising step for selected atoms. The size of the denoising step for each atom or sidechain is determined by the last time that atom or sidechain took a denoising step. Each amino acid sidechain from the superposition is denoised only when it is selected by the sequence model. This means that the size of the denoising/integration step can vary depending on how frequently that amino acid is predicted. The backbone is denoised at every step since these atoms are common to all amino acids. For more details and the pseudocode of the sampling algorithm, see the Method section and Algorithm 1. (c) A visualization of the sidechain superposition idea and how it might be collapsed or updated at each denoising step. Sidechains for all 20 amino acids are modeled at once (here aligned on their N, CA, and C atoms). Given an amino acid type, we can collapse the superposition from all states to a single state, which yields a “valid” residue or protein. Alternatively, given an amino acid type and newly predicted coordinates for that sidechain, we can update the superposition with new information.
Figure 2.
Figure 2.. Evaluation of proteins sampled from the backbone-only model.
(a) Self-consistency performance. We show the RMSD = 2 threshold (dashed line) and the proportion of samples passing this threshold, smoothed with a sliding window of size 11 (solid line). Eight backbones were sampled for each length from 50 to 400. For each backbone, the best of 8 ProteinMPNN-designed sequences is selected and ESMFold is used for all structure predictions. (b) The same samples and ESMFold predictions as in (a), but using the scTM metric. TM is computed using the same alignment as for RMSD. Dashed line indicates TM = 0.5; solid line indicates proportion of samples with TM > 0.5, smoothed with sliding window of size 11. (c) Example high-quality, novel backbone model samples in green, shown aligned to the ESMFold prediction (blue) and the nearest neighbor in the dataset (red). The lengths, scRMSD, pLDDT, and nnTM metrics for each sample are also shown. (d) Number of structure clusters per samples drawn (left axis), and ratio of number of clusters to number of samples (right axis). Samples drawn uniformly over each length from 50 to 256 are used for this plot; the first 1, 2, 3, … samples for each length are used (so the number of samples is 206, 412, 618, … and so on, and there is always the same number of proteins for each length). (e) Secondary structure content of samples, computed by DSSP. (f) Nearest neighbor distances for model samples with scRMSD < 5. The nnTM is the TM score against the dataset member with the highest TM score to the sample, extracted with Foldseek.
Figure 3.
Figure 3.. Evaluation of proteins sampled from the all-atom model.
(a) Self-consistency performance computed as in Fig. 1, but for the all-atom model. Eight proteins were sampled for each length from 50 to 256. Each protein’s sequence is used for ESMFold, i.e. only one sequence is predicted for each sample, rather than the 8 sequences per sample that were predicted for the backbone-only model samples. The success proportion line is smoothed with a sliding window of 21. (b) The same samples and ESMFold predictions as in (a), but using the scTM metric. (c) Example high-quality, novel all-atom model samples. (d)Number of structure clusters per samples drawn (left axis), and ratio of number of clusters to number of samples (right axis). Samples are drawn uniformly over each length from 50 to 256, as in Fig. 2. (e) Secondary structure content of samples, computed by DSSP. (f) Nearest neighbor distances for model samples with scRMSD < 5. The nnTM is the TM score against the dataset member with the highest TM score to the sample, extracted with FoldSeek.
Figure 4.
Figure 4.. Analysis of generated all-atom structures, including sidechains.
(a-c) Comparison of distributions of (a) bond lengths, (b) bond angles, and (c) chi angles for training data and model samples. Quantities for real data are computed from 100 random proteins from the training set; quantities for model samples are computed from 1 sample for each even-numbered protein length from 50 to 256. (d) Detailed views of all-atom raw model output with sidechains built. These are the same proteins as those shown in Fig. 3, only with sidechains rendered. The bond length RMSE is shown, which is computed by averaging the RMSE between each individual bond length and an idealized bond length in angstroms. For comparison, unrelaxed structures of natural proteins typically have an average bond length RMSE of 0.01–0.02. (e) Distribution of fa_dun energies for model samples and natural proteins. Statistics are computed from 5000 residues chosen at random (without regard to individual proteins) each from the dataset and the set of model samples. The fa_dun energy is computed from the probability of a rotamer given the backbone torsions and a potential term for deviation from an ideal chi angle value. (f) Visualizing the model samples data from (e) on a Ramachandran plot. Each point is a pair of residue backbone torsions, colored by the fa_dun Rosetta energy.
Figure 5.
Figure 5.. Towards all-atom protein design.
Potential applications of our model for new approaches to protein design. The conditioning portion is shown in gold on the model sample (green indicates the model-generated portion). These designs are generated with an initial crop-conditional model and reconstruction guidance. (a) An example design generated by scaffolding a TGF-β1 binding loop including its sidechains. The original binder design (in gold) is a de novo designed monobody and thus is guaranteed not to be in the training set. The pink and cyan chains are TGF-β1 (PDB: 4KV5). (b) An example design generated by scaffolding only the functional groups of iron-binding Glu and His residues. The model is given only the atoms after the last chi angle: (CG, CD, OE1, OE2) for the Glu residues and (CB, CG, CD2, CE1, ND1, NE2) for the His residues. The original native fold is chain A of 1BCF.

References

    1. Huang Po-Ssu, Boyken Scott E., and Baker David. The coming of age of de novo protein design. Nature, 537 (7620):320–327, 2016. doi:10.1038/nature19946. - DOI - PubMed
    1. Korendovych Ivan V. and DeGrado William F.. De novo protein design, a retrospective. Quarterly Reviews of Biophysics, 53, 2020. doi:10.1017/s0033583519000131. - DOI - PMC - PubMed
    1. Huang Po-Ssu, Yih-En Andrew Ban Florian Richter, Andre Ingemar, Vernon Robert, Schief William R., and Baker David. Rosettaremodel: A generalized framework for flexible backbone protein design. PLoS ONE, 6(8), 2011. doi:10.1371/journal.pone.0024109. - DOI - PMC - PubMed
    1. Anand Namrata and Huang Possu. Generative modeling for protein structures. In Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., and Garnett R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/afa299a4d1d8c....
    1. Anand Namrata, Eguchi Raphael, and Huang Po-Ssu. Fully differentiable full-atom protein backbone generation, 2019. URL https://openreview.net/forum?id=SJxnVL8YOV.

Publication types