Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 22;377(6604):387-394.
doi: 10.1126/science.abn2100. Epub 2022 Jul 21.

Scaffolding protein functional sites using deep learning

Affiliations

Scaffolding protein functional sites using deep learning

Jue Wang et al. Science. .

Abstract

The binding and catalytic functions of proteins are generally mediated by a small number of functional residues held in place by the overall protein structure. Here, we describe deep learning approaches for scaffolding such functional sites without needing to prespecify the fold or secondary structure of the scaffold. The first approach, "constrained hallucination," optimizes sequences such that their predicted structures contain the desired functional site. The second approach, "inpainting," starts from the functional site and fills in additional sequence and structure to create a viable protein scaffold in a single forward pass through a specifically trained RoseTTAFold network. We use these two methods to design candidate immunogens, receptor traps, metalloproteins, enzymes, and protein-binding proteins and validate the designs using a combination of in silico and experimental tests.

PubMed Disclaimer

Conflict of interest statement

Competing interests

Authors declare that they have no competing interests.

Figures

Figure 1.
Figure 1.. Methods for protein function design
(A) Applications of functional-site scaffolding. (B-C) Design methods. (B) Constrained hallucination. At each iteration, a sequence is passed to the trRosetta or RosettaFold neural network, which predicts 3D coordinates and residue-residue distances and orientations (Fig. S2) which are scored by a loss function that rewards certainty of the predicted structure along with motif recapitulation and other task-specific functions. (C) Missing information recovery (“Inpainting”). Partial sequence and/or structural information is input into a modified RosettaFold network (termed RFjoint), and complete sequence and structure are output. (D) Protein design challenges formulated as missing information recovery problems. (E) Joint RosettaFold (RFjoint) can simultaneously recover structure and sequence of a masked region of protein. 2KL8 was fed into RFjoint with a continuous (length 30) window of sequence and structure masked out, with the network tasked with predicting the missing region of protein. Outputs (inpainted region in gray) closely resemble the original protein (2KL8, left) and are confidently predicted by AlphaFold (pLDDT/Motif RMSD of models shown: 91.6/0.91, 92.0/0.69, 90.4/0.82 respectively). (F-G) Motif scaffolding benchmarking data comparing RFjoint with constrained hallucination. A set of 28 de novo designed proteins, published since RosettaFold was trained, were used. For each protein, 20 random masks of length 30 were generated, and RFjoint and hallucination were tasked with filling in the missing sequence and structure to “scaffold” the unmasked “Motif”. For this mask length, RFjoint typically modestly outperforms hallucination, both in terms of the RMSD of the unmasked protein (the “motif”) to the original structure (F), and in AlphaFold confidence (pLDDT in the replaced region) (G). Circles: Average of 20 outputs for each of the benchmarking proteins. Triangle: 2KL8. Colors in all panels: native functional motif (orange); hallucinated/inpainted scaffold (gray); constrained motif (purple); binding partner (blue); non-masked region (green); masked region (light gray, dotted lines).
Figure 2.
Figure 2.. Design of epitope scaffolds and receptor traps.
(A) Design of proteins scaffolding immunogenic epitopes on RSV protein F (site II: PDB 3IXT chain P residues 254–277; site V: 5TPN chain A residues 163–181). Comparisons of the RF hallucinated models to AF2 structure predictions from the design sequence are in Fig. S9; here because of space constraints we show only the AF2 model; the two are very close in all cases. Here and in the following figures, we assess the extent of success in designing sequences which fold to structures harboring the desired motif through two metrics computed on the AF2 predictions: prediction confidence (AF pLDDT), and the accuracy of recapitulation of the original scaffolded motif (motif RMSD AF versus native). For RSV-F designs, these metrics are rsvf_ii_141 (85.0, 0.53 Å), rsvf_ii_158 (82.9, 0.51 Å), rsvf_ii_171 (88.4, 0.69 Å); rsvfv_hal_1 (82, 0.7 Å); rsvfv_hal_2 (88, 0.64 Å); rsvfv_hal_3 (86, 0.65 Å). (B) Design of COVID-19 receptor trap based on ACE2 interface helix (6VW1 chain A residues 24–42). Design metrics: ace2_76 (89.1, 0.55 Å); ace2_1157 (80.4, 0.47 Å); ace2_1007 (83.3, 0.57 Å). Colors: native protein scaffold (light yellow); native functional motif (orange); hallucinated scaffold (gray); hallucinated motif (purple); binding partner (blue). See Table S2 for additional metrics on each design. (C) Normalized maximum SPR signal (response units) of purified RSV-F epitope scaffolds and point mutants at various concentrations of hRSV90 antibody, with sigmoid fits. RSV-F refers to purified trimeric native F protein. KD values for each design are shown in legend. (D) Mean residue ellipticity (MRE) versus wavelength, from CD spectroscopy, for the 3 RSV-F site V hallucinations with binding activity.
Figure 3.
Figure 3.. Design of metal binding
(A) Di-iron binding site from E. coli cytochrome b1 (1BCF chain A residues 18–25, 27–54, 94–97, 123–130). Colors: native protein scaffold (light yellow); native functional motif (orange); hallucinated scaffold (gray); hallucinated motif (purple); bound metal (blue). Active site residues shown in boxes for di-iron and EF-hand respectively. (B) Absorbance spectra showing of dife_inp_1 (or mutant) in the presence (or not) of an 8-fold molar excess of Co2+. Note the peaks at 520 nm, 555 nm and 600 nm, consistent with Co2+ binding to the desired scaffolded motif (33). The mutant design was the same sequence but with the 6 coordinating residues (sidechains shown in (A)) mutated to alanine [E16A, E55A, H58A, E89A, H92A, E115A]). Protein concentration was 200 μM. (C) Titration analysis of Co2+ against the design (protein concentration = 200 μM). Quantification of the absorbance at 550 nm, using a predicted extinction coefficient of 155 for Co2+ binding the motif (33), is consistent with both binding sites being recapitulated in the dife_inp_1 design. (D) CD spectra of design in the presence and absence of Co2+. Both spectra are consistent with the predicted helical structure. (E) CD melt curve in the presence and absence of Co2+. Note that the coordination of Co2+ in the protein core significantly stabilizes dife_inp_1 (protein concentration in CD experiments = 6.7 μM, Co2+ concentration = 53.3 μM). (F) AF2 prediction of inpainted design EFhand_inp_1 scaffolding the double EF-hand motif with input motif residues in purple, input non-motif residues in green, and overlaid with the native motif from 1PRW (orange). (G) Tryptophan-enhanced terbium fluorescence spectra of EFhand_inp_1 matches known spectra (57) and suggests the design can bind terbium. (H) CD spectra of EFhand_inp_1 incubated with (4X protein concentration) and without CaCl2 suggest stabilization of the protein upon binding calcium. Design metrics (AF pLDDT, motif RMSD AF versus native): dife_inp_1 (92 /0.65 Å), EFhand_inp1 (84, 0.7 Å).
Figure 4.
Figure 4.. In silico design of enzyme active sites.
(A-B) Hallucinations using backbone description of site using RF. (C-D) Hallucination using sidechain description of site using AF2 augmented with trRosetta (Materials and Methods). (A) Carbonic anhydrase II active site (5YUI chain A residues 62–65, 93–97, 118–120). (B) Δ5-3-ketosteroid Isomerase active site (1QJG chain A residues 14, 38, 99). Colors: native protein scaffold (light yellow); native functional motif (orange); hallucinated scaffold (gray); hallucinated motif (purple); bound metal (blue). Active site residues shown for boxed designs in panel B and for carbonic anhydrase II, and Δ5-3-Ketosteroid Isomerase respectively. Design metrics (AF pLDDT, motif RMSD AF versus native): hcA_1 (73, 1.04 Å), hcA_2 (71, 0.62 Å), KSI_1 (84, 0.30 Å Cb), KSI_2 (72, 0.53 Å Cb)
Figure 5.
Figure 5.. Design of protein-binding proteins.
Designs containing target-binding interfaces built around native-complex-derived binding motifs. Targets are in blue, native scaffolds in yellow or pink, native motifs in orange, designed scaffolds in gray and designed motifs in purple. (A) Crystal structure of high-affinity consensus (HAC) PD-1 in complex with PD-L1. (B) Inpainted PD-L1 binder superimposed on PD-1 interface motif. (C) Max BLI binding signal versus PD-L1 concentration. (D) Crystal structure of previously designed TrkA minibinder in complex with TrkA, superimposed on TrkA receptor dimer. (E) Hallucinated bivalent TrkA binder. Protein topologies of (D-E) are shown to the right. (F) Max BLI binding signal versus TrkA concentration, showing that both binding sites bind TrkA. (G) Hallucinated Mdm2 binder designs superimposed on native p53 helix in complex with Mdm2 (see also Fig. S17D–E). New binding interactions (hallucinated residues within 5 Å of the target) are in green. Inset: Overlay of mdm2_hal_1 and native p53 helix showing key sidechains for binding.

References

    1. Khersonsky O, Wollacott AM, Jiang L, Dechancie J, Betker J, Gallaher JL, Althoff EA, Zanghellini A, Dym O, Albeck S, Houk KN, Tawfik DS, Baker D, Kemp elimination catalysts by computational enzyme design. 453 (2008), doi:10.1038/nature06879. - DOI - PubMed
    1. Jiang L, Althoff EA, Clemente FR, Doyle L, Röthlisberger D, Zanghellini A, Gallaher JL, Betker JL, Tanaka F, Barbas CF, Hilvert D, Houk KN, Stoddard BL, Baker D, De Novo Computational Design of Retro-Aldol Enzymes. Science. 319, 1387–1391 (2008). - PMC - PubMed
    1. Siegel JB, Zanghellini A, Lovick HM, Kiss G, Lambert AR, St. Clair JL, Gallaher J, Hilvert D, Gelb MH, Stoddard BL, Houk KN, Michael FE, Baker D, Computational Design of an Enzyme Catalyst for a Stereoselective Bimolecular Diels-Alder Reaction. Science. 329 (2010), doi:10.1126/science.1190239. - DOI - PMC - PubMed
    1. Cao L, Coventry B, Goreshnik I, Huang B, Park JS, Jude KM, Marković I, Kadam RU, Verschueren KHG, Verstraete K, Walsh STR, Bennett N, Phal A, Yang A, Kozodoy L, DeWitt M, Picton L, Miller L, Strauch E-M, DeBouver ND, Pires A, Bera AK, Halabiya S, Hammerson B, Yang W, Bernard S, Stewart L, Wilson IA, Ruohola-Baker H, Schlessinger J, Lee S, Savvides SN, Garcia KC, Baker D, Design of protein binding proteins from target structure alone. Nature (2022), doi:10.1038/s41586-022-04654-9. - DOI - PMC - PubMed
    1. Chevalier AA, Silva D, Rocklin GJ, Derrick R, Vergara R, Murapa P, Bernard SM, Zhang L, Yao G, Bahl CD, Miyashita S, Goreshnik I, James T, Bryan M, Fernández-velasco DA, Stewart L, Dong M, Huang X, Massively parallel de novo protein design for targeted therapeutics. Nat. Publ. Group (2017), doi:10.1038/nature23912. - DOI - PMC - PubMed