Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug;22(8):1698-1706.
doi: 10.1038/s41592-025-02741-z. Epub 2025 Aug 4.

Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites

Affiliations

Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites

Benedict Wolf et al. Nat Methods. 2025 Aug.

Abstract

Domain insertion engineering is a powerful approach to juxtapose otherwise separate biological functions, resulting in proteins with new-to-nature activities. A prominent example are switchable protein variants, created by receptor domain insertion into effector proteins. Identifying suitable, allosteric sites for domain insertion, however, typically requires extensive screening and optimization. We present ProDomino, a machine learning pipeline to rationalize domain recombination, trained on a semisynthetic protein sequence dataset derived from naturally occurring intradomain insertion events. ProDomino robustly identifies domain insertion sites in proteins of biotechnological relevance, which we experimentally validated in Escherichia coli and human cells. Finally, we used light- and chemically regulated receptor domains as inserts and demonstrate the rapid, model-guided creation of potent, single-component opto- and chemogenetic protein switches. These include novel CRISPR-Cas9 and -Cas12a variants for inducible genome engineering in human cells. Our work enables one-shot domain insertion engineering and substantially accelerates the design of customized allosteric proteins.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Intradomain insertions are common in natural proteins.
a, Novel proteins can arise from the insertion of one domain into another. b, Strategy for generating a large domain insertion dataset of natural proteins. c, The number of unique domain superfamily combinations is shown for a given parent (left panel) or insert domain (right panel). Only the top five most promiscuous domains are shown. d, Alphafold2-generated structures of example proteins from the dataset with PDZ domain insertions. Insert domains are in green and parent proteins in blue. e, Length distribution of insert domains, parent domains (without the respective insert) and parent proteins in the dataset. Parent domains are defined as the annotated Interpro domains originally carrying the insert, while parent protein refers to the full-length protein without the insert. aa, amino acids. f, Distribution of relative insertion site positions within parent domains and parent proteins. gj, The frequency of different CATH-GENE3D domain types at the ‘class’ (g) and ‘architecture’ (h, mainly alpha; i, mainly beta; j, alpha beta) levels (according to the CATH hierarchy) within the whole CATH database are compared with the corresponding distribution of the insert and parent domains in our dataset.
Fig. 2
Fig. 2. A machine learning model to infer domain insertion sites in proteins.
a, Schematic of the machine learning pipeline for protein insertion site prediction. bd, Boxplots showing prediction scores for true positive (green) and negative labels (other positions, unknown; blue) on a test set. The performance of different models trained with different encoding strategies (b), on different dataset splits (random; Interpro, based on domain classes; single, one representative example per class) (c) or using positional masking (d) is shown (see Supplementary Note 1 for details). e, Boxplot of insertion scores predicted by the model variant trained on the ‘single’ representative protein dataset split, grouped by secondary (Sec.) structures. The calculation is based on secondary structure predictions for the entire test set. be, Boxes represent the interquartile range (IQR) and the median is represented by a horizontal line. Whiskers extend to the 1.5-fold IQR or to the value of the smallest or largest predicted value. n = 1,382 protein sequences with 1,382 known positive insertion sites and 325,510 unknown sites. f,g, Exemplary predictions from the test set. The natural insertion sites are marked in green and the insert domain is colored accordingly in the protein structures. f, Phosphoglycerate kinase (PDB ID 4NG4); g, Rvb1/Rvb2 heterohexamer (RuvB-like 1, PDB ID 5OAF). h, The insertion score for the bacterial transcription factor AraC is indicated for each amino acid position by a black line. Green positions indicate experimentally validated insertion-tolerant sites. The domains and secondary structure elements of AraC are annotated. i, AUROC plot of the insertion site prediction for AraC. j, Insertion scores mapped onto the Alphafold2-predicted AraC protein structure. In h,j, allosteric insertion sites, previously validated in experiments (I113 and S170), are indicated. AUC, area under the curve.
Fig. 3
Fig. 3. ProDomino informs the engineering of light-controlled antibiotic resistances.
a,b, The insertion scores predicted by ProDomino are mapped onto the primary sequences of PAC (a) and CAT (b). Sites selected for experimental testing by domain insertion are marked with a purple line. c,d Insertion scores are mapped onto the crystal structures of PAC (c) and CAT (d). The selected insertion sites are indicated (PDB ID 7K0A and 1PD5). e, Scheme of light-regulated PAC/CAT function. f, Light control of puromycin resistance. HEK293T cells transfected with vectors encoding the respective PAC variants or a negative control expressing enhanced green fluorescent protein (eGFP, control) were treated with 10 µg ml−1 puromycin, starting 24 h posttransfection. Illumination (or incubation in the dark) began concurrently with the start of puromycin treatment and continued for 48 h, followed by microscopy. The experiment was independently replicated three times under similar conditions and a representative image is shown. g, Light-controlled E. coli culture growth. Bacteria were transformed with plasmids expressing the indicated CAT variant or an empty control plasmid. Liquid cultures were grown in the presence of 25 µg ml−1 chloramphenicol and exposed to blue light for 7 h or kept in the dark, followed by assessment of cell density at 600 nm. Bars indicate means, error bars the standard deviation and black dots individual data points from n = 3 independent experiments. Amino acid sequences of the symmetric linkers at the receptor domain boundaries are indicated. GS, glycine-serine linker; GPG, glycine-proline-serine linker; WT, wild type. h, Spatial regulation of bacterial growth. E. coli expressing CAT-K136-LOV and monomeric red fluorescent protein (mRFP) were plated in top agar supplemented with 25 µg ml−1 chloramphenicol. During incubation at 37 °C, the plates were illuminated through a photomask (top) and fluorescent cells were imaged under UV light (bottom).
Fig. 4
Fig. 4. ProDomino confidently predicts potent opto- and chemogenetic Cas9 and Cas12a variants.
a,d, Insertion scores predicted by ProDomino are mapped onto the primary sequences of Cas9 (a) or Cas12a (d). Selected high- and low scoring sites are marked in purple and gray, respectively. b,e, The insertion scores predicted by ProDomino are mapped onto experimentally resolved structures of Cas9 (b) and Cas12a (e). Insertion sites selected for experimental validation are indicated (PDB ID 4UN3 and 6IV6). f, Zoomed-in views of the insertion sites of the two Cas12a lead candidates. c, HEK293T cells were transfected with vectors encoding (1) the indicated Cas9/VPR-LOV hybrid variant (or Cas9/VPR as control), (2) a TetO targeting sgRNA together and a Renilla luciferase and (3) a firefly luciferase preceded by multiple TetO repeats. Samples were incubated under blue light or in the dark for 48 h, followed by luciferase assay. gi, HEK293T cells were transfected with vectors encoding (1) the indicated Cas12a–GR2 hybrid variant (or wild-type Cas12a as control), and (2) a gRNA targeting the endogenous RUNX (g), GRIN2B (h) or VEGFA (i) locus. Samples were treated with cortisol or DMSO as indicated. 72 h posttransfection, InDel frequencies were assessed by next-generation sequencing. c,gi, Bars indicate means, error bars the standard deviation and black dots individual data points from n = 2 (h) or n = 3 (c,g,i) independent experiments. Cor, cortisol.
Extended Data Fig. 1
Extended Data Fig. 1. The number of training steps affects model sensitivity.
Model prediction scores for true insertion sites and other (unknown) positions are shown as box plots. Numbers above each plot indicate model training duration in steps. Boxes represent the interquartile range (IQR) and the median is represented by a horizontal line. Whiskers extend to the 1.5-fold IQR or to the value of the smallest or largest predicted value. n = 1,382 protein sequences with 1,382 known positive insertion sites and 325,510 unknown sites.
Extended Data Fig. 2
Extended Data Fig. 2. ProDomino correctly identifies insertion-tolerant regions in AraC.
a, The insertion score for the bacterial transcription factor AraC is shown for each amino acid position. The scores of models trained for different numbers of steps are shown in different shades of gray. Five individual subplots are presented for clarity. Green regions indicate experimentally validated insertion tolerant sites. The two sites previously used to engineer light-regulated AraC variants, I113 and S170, are indicated in dark green. b, ROC curves based on the predictions in a are shown. The area under the curve (AUC) is given for each model variant.
Extended Data Fig. 3
Extended Data Fig. 3. Domain insertion screening of PAC and CAT confirms ProDomino predictions.
a, c, ProDomino inferred insertion scores are mapped onto the primary sequence of PAC (a) and CAT (c). Insertion sites selected for experimental testing are marked by vertical lines and color coded as indicated. b, Assessment of insertion tolerance in PAC. HEK293T cells were transfected with vectors encoding the respective PAC variants carrying PDZ insertions after the indicated residue or a negative control expressing enhanced green fluorescent protein (eGFP). Cells were treated with 5 µg/mL puromycin and incubated for 48 hours before cell viability was assessed by MTT assay. Cells treated with different concentrations of toxic Triton X-100 served as controls for the assay itself. Bars indicate means, error bars the standard deviation, and black dots individual data points from n = 3 independent experiments. d, Assessment of the CAT insertion permissibility. Bacteria were transformed with plasmids expressing the indicated CAT variant or an empty control plasmid. Liquid cultures were grown in the presence of 25 µg/mL chloramphenicol for 7 hours and cell density was assessed by measuring OD at 600 nm. Light gray bars represent PDZ insertions behind the indicated residue and dark gray bars correspond to LOV2 insertions at the same position. Bars indicate means, error bars the standard deviation, and black dots individual data points from n = 3 independent experiments.
Extended Data Fig. 4
Extended Data Fig. 4. ProDomino prediction scores for SpyCas9.
ProDomino inferred insertion scores are mapped onto the primary sequence (a) and the structure (b) of SpyCas9. Insertion scores correspond to the 1,500-step model in Supplementary Fig. 13. Green indicates insertion sites selected for experimental validation in Fig. 4c. b, PDB ID: 4UN3.
Extended Data Fig. 5
Extended Data Fig. 5. Experimental assessment of insertion tolerance in Cas12a.
a, Insertion scores are mapped onto a cryo-electron microscopy (cryo-EM) structure of MbCas12a. PDB ID: 6IV6. b, c, HEK293T cells were transfected with vectors encoding (i) the indicated Cas12a-PDZ insertion variant, (ii) a firefly luciferase targeting gRNA and (iii) a luciferase reporter. Samples were incubated for 48 hours, and luciferase activity was measured in a plate reader. The activity of insertion variants predicted to be active (b) or inactive (c) is shown. Bars indicate means, error bars the standard deviation, and black dots individual data points from n = 3 independent experiments. nt, non-targeting gRNA.

References

    1. Ponting, C. P. & Russell, R. R. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct.31, 45–71 (2002). - PubMed
    1. Marsh, J. A. & Teichmann, S. A. How do proteins gain new domains? Genome Biol.11, 126 (2010). - PMC - PubMed
    1. Apic, G. & Russell, R. B. Domain recombination: a workhorse for evolutionary innovation. Sci. Signal.3, 139 (2010). - PubMed
    1. Mathony, J. & Niopek, D. Enlightening allostery: designing switchable proteins by photoreceptor fusion. Adv. Biol.5, 2000181 (2021). - PubMed
    1. Ostermeier, M. Designing switchable enzymes. Curr. Opin. Struct. Biol.19, 442–448 (2009). - PMC - PubMed

LinkOut - more resources