Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov;623(7989):1070-1078.
doi: 10.1038/s41586-023-06728-8. Epub 2023 Nov 15.

Illuminating protein space with a programmable generative model

Affiliations

Illuminating protein space with a programmable generative model

John B Ingraham et al. Nature. 2023 Nov.

Abstract

Three billion years of evolution has produced a tremendous diversity of protein molecules1, but the full potential of proteins is likely to be much greater. Accessing this potential has been challenging for both computation and experiments because the space of possible protein molecules is much larger than the space of those likely to have functions. Here we introduce Chroma, a generative model for proteins and protein complexes that can directly sample novel protein structures and sequences, and that can be conditioned to steer the generative process towards desired properties and functions. To enable this, we introduce a diffusion process that respects the conformational statistics of polymer ensembles, an efficient neural architecture for molecular systems that enables long-range reasoning with sub-quadratic scaling, layers for efficiently synthesizing three-dimensional structures of proteins from predicted inter-residue geometries and a general low-temperature sampling algorithm for diffusion models. Chroma achieves protein design as Bayesian inference under external constraints, which can involve symmetries, substructure, shape, semantics and even natural-language prompts. The experimental characterization of 310 proteins shows that sampling from Chroma results in proteins that are highly expressed, fold and have favourable biophysical properties. The crystal structures of two designed proteins exhibit atomistic agreement with Chroma samples (a backbone root-mean-square deviation of around 1.0 Å). With this unified approach to protein design, we hope to accelerate the programming of protein matter to benefit human health, materials science and synthetic biology.

PubMed Disclaimer

Conflict of interest statement

All authors are employees and shareholders of Generate Biomedicines.

Figures

Fig. 1
Fig. 1. Chroma is a generative model for proteins and protein complexes that combines structured diffusion for protein backbones with scalable molecular neural networks for backbone synthesis and all-atom design.
a, A correlated diffusion process with chain and radius-of-gyration constraints gradually transforms protein structures into random collapsed polymers (right to left). The reverse process (left to right) can be expressed in terms of a time-dependent optimal denoiser x^θxt,t that maps noisy coordinates xt at time t to predicted denoised coordinates x0. b, We parameterize this in terms of a random graph neural network with long-range connectivity inspired by efficient N-body algorithms (middle) and a fast method for solving for a global consensus structure given predicted inter-residue geometries (right). Another graph-based design network (a, top right) generates protein sequences and side-chain conformations conditionally based on the sampled backbone. c, The time-dependent protein prior learnt by the diffusion model can be combined with composable restraints and constraints for the programmable generation of protein systems.
Fig. 2
Fig. 2. Analysis of unconditional samples reveals diverse geometries that exhibit new higher-order structures and refold in silico.
a, A representative set of Chroma-sampled proteins and protein complexes exhibits complex and diverse topologies with high secondary-structure content, including familiar TIM (triose-phosphate isomerase) barrel-like folds (top left), antibody–antigen-like complexes (centre right) and new arrangements of helical bundles and β-sheets. b,c, Despite these qualitative similarities, samples frequently have low nearest-neighbour similarity to structures in the PDB, as measured by nearest-neighbour TM score (b; Supplementary Appendix J.4), with structures demonstrating frequent novelty across length ranges (c). d,e, When we attempted to refold samples in silico using only a single sequence sample per structure, we observed widespread refolding with a high degree of superposition (d), including occasionally in the very high length range of more than 800 residues (e).
Fig. 3
Fig. 3. Symmetry, substructure and shape conditioning enable geometric molecular programming.
a, Sampling oligomeric structures with arbitrary chain symmetries is possible by using a conditioner that tessellates an asymmetric subunit in the energy function. Cyclic (Cn), dihedral (Dn), tetrahedral (T), octahedral (O) and icosahedral (I) symmetry groups can produce a wide variety of possible homomeric complexes. The right-most protein complex contains 60 subunits and 60,000 total residues, which is enabled by leveraging symmetries and using our subquadratically scaling architecture. b, Conditioning on partial substructure (monochrome) enables protein infilling or outfilling. The top two rows illustrate regeneration (colour) of half a protein (the enzyme DHFR, first row) or complementarity-determining region loops of a VHH antibody (second row). The next three rows show conditioning on a predefined motif. The order and matching location of motif segments is not prespecified here. c, Conditioning on arbitrary volumetric shapes is exemplified by the complex geometries of the Latin alphabet and Arabic numerals. All structures were selected from protocols with high rates of in silico refolding (Supplementary Appendix K).
Fig. 4
Fig. 4. Protein structure classifiers and caption models can bias the sampling process towards user-specified properties.
a, Neural networks trained to predict protein properties can bias unconditional samples (top) towards states that optimize predicted properties, such as secondary-structure composition (bottom) indicated by CATH class level codes (C1, Mainly Alpha; C2, Mainly Beta; C3, Alpha Beta). b, A neural network trained to predict CATH topology annotations can routinely drive generation towards samples with high predicted probabilities of the intended class label, which sometimes aligns with our intended fold topology for highly abundant labels. Left, highly abundant Rossmann fold (CATH topology 3.40.50, 14.0% of training set); middle, highly abundant Ig fold (CATH topology 2.60.40, 9.8% of training set); right, a rare specific β-barrel fold (CATH topology 2.40.155, 0.07% of training set). c, Fine-tuning a multi-label predictor to bias a pretrained large language model into a structure caption predictor can enable natural language conditioning. We begin to see examples of semantic alignment between prompts and output structures for highly abundant classes of structures, although we do not always see this reflected in the time-zero caption perplexity (CP, lower is better). Left, ‘crystal structure of a Rossmann fold’; right, ‘crystal structure of a Fab antibody fragment’.
Fig. 5
Fig. 5. Experimental validation of Chroma-designed proteins.
a, Protocol for protein design and experimental validation. Unconditional designs: 268 proteins. Semantic conditioning: 12 α-conditioned, 13 β-conditioned, 11 α/β mixtures and 6 with β-barrel topology. See text for details. b, Rank-ordered unconditional Chroma protein solubility scores by the split-GFP assay for 172 tested proteins. Red dots and error bars denote means and standard deviations, respectively, from three biological replicates. c,d, X-ray crystal structures (rainbow) of UNC_079 (c, 1.1 Å resolution, PDB 8TNM, root-mean-square deviation (RMSD) = 1.1 Å) and UNC_239 (d, 2.4 Å resolution, PDB 8TNO, RMSD = 1.0 Å) overlaid with Chroma-generated models (grey). Insets compare each crystal structure (rainbow) with its nearest PDB match (4NH2 and 6AFV, respectively; grey). e, CD data for seven purified Chroma proteins. The fraction of α-helical and β-strand content was determined using BeStSel. Tm is the melting temperature determined by differential scanning calorimetry and SS designates secondary structure. f, CD data for three purified Chroma conditional designs: SEM_018 (α-conditioned), SEM_038 (β-barrel topology) and SEM_011 (α/β mixture). g,h, Correlation between predicted secondary-structure content in Chroma designs compared with the prediction from CD, for α-helical (g) and β-strand (h) content.

References

    1. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489. doi: 10.1093/nar/gkaa1100. - DOI - PMC - PubMed
    1. Kuhlman B, Bradley P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 2019;20:681–697. doi: 10.1038/s41580-019-0163-x. - DOI - PMC - PubMed
    1. Huang P-S, Boyken SE, Baker D. The coming of age of de novo protein design. Nature. 2016;537:320–327. doi: 10.1038/nature19946. - DOI - PubMed
    1. Koga N, et al. Principles for designing ideal protein structures. Nature. 2012;491:222–227. doi: 10.1038/nature11600. - DOI - PMC - PubMed
    1. Cao L, et al. Design of protein-binding proteins from the target structure alone. Nature. 2022;605:551–560. doi: 10.1038/s41586-022-04654-9. - DOI - PMC - PubMed

LinkOut - more resources