Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug;41(8):1151-1159.
doi: 10.1038/s41587-022-01613-7. Epub 2023 Jan 16.

Predicting prime editing efficiency and product purity by deep learning

Affiliations

Predicting prime editing efficiency and product purity by deep learning

Nicolas Mathis et al. Nat Biotechnol. 2023 Aug.

Abstract

Prime editing is a versatile genome editing tool but requires experimental optimization of the prime editing guide RNA (pegRNA) to achieve high editing efficiency. Here we conducted a high-throughput screen to analyze prime editing outcomes of 92,423 pegRNAs on a highly diverse set of 13,349 human pathogenic mutations that include base substitutions, insertions and deletions. Based on this dataset, we identified sequence context features that influence prime editing and trained PRIDICT (prime editing guide prediction), an attention-based bidirectional recurrent neural network. PRIDICT reliably predicts editing rates for all small-sized genetic changes with a Spearman's R of 0.85 and 0.78 for intended and unintended edits, respectively. We validated PRIDICT on endogenous editing sites as well as an external dataset and showed that pegRNAs with high (>70) versus low (<70) PRIDICT scores showed substantially increased prime editing efficiencies in different cell types in vitro (12-fold) and in hepatocytes in vivo (tenfold), highlighting the value of PRIDICT for basic and for translational research applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Extended Data Fig. 1
Extended Data Fig. 1. Self-targeting screen characteristics.
(a) Visualization of library design (library 1) and numbers before and after filtering results. (b) Distribution of edit positions for single base replacement edits in library 1. (c) Distribution of edit positions for insertion edits in library 1. (d) Distribution of edit positions for deletion edits in library 1. (e) Distribution of insertion lengths in library 1. (f) Distribution of deletion lengths in library 1. (g) Distribution of edit types in library 1 (number of design variants and percentage of the total library). (h-i) Editing rates of a test self-targeting locus with a forward (Fw) or reverse (Rv) orientation of the target sequence. Either on plasmid level or integrated by lentiviral transduction in HEK293T cells. Data points for bars (from left) 2,3 and 5,6 correspond to 2 technical replicates (simultaneous transfection of two separate wells). Only one data point was used for the plasmid controls (bar 1 and 4). (h) pegRNA with TAG to TGG edit. (i) pegRNA with TAG to TAC edit. The observed editing in the forward direction in the absence of PE2 could be caused by lentiviral reshuffling or ADAR-mediated A to I (G) RNA editing. The latter could occur during lentiviral packaging in HEK293T cells: HEK293T cells endogenously express ADAR and the target site is present as RNA on the lentiviral vector and targeted by the complementary pegRNA with a mismatch, providing an ideal template for ADAR-dependent RNA editing. The observation that primarily TAG to TGG (but not TAG to TAC) showed background editing is in line with this hypothesis, as previous studies showed ADAR preference for UAG sequences.
Extended Data Fig. 2
Extended Data Fig. 2. Additional validation of the DeepPE model.
(a) Predicted (PRIDICT) and measured intended editing efficiency for GtoC edits at position 5 of RTT in the dataset from this study. Data from all 5 test sets (five-fold cross-validation) were combined for this visualization. n = 540 (b) Evaluation of PRIDICT AttnBiRNN (trained on the dataset from this study) by testing on pegRNAs from Kim et al. 2021 HT dataset (only G to C at Position 5). n = 4,457. (c) Evaluation of DeepPE model (original, trained on Kim et al. 2021 HT dataset) by testing on the dataset from this study (only G to C at Position 5). n = 540. (d,e) SHAP analysis of XGBoost models trained and tested (d) on DeepPE dataset (n = 43,149) or (e) on G-to-C Position 5 edits from library 1. Feature descriptions are listed in Supplementary Table 1. (f) Editing efficiency with different RTT overhang lengths (5, 7, 10, 15 bp) in DeepPE (Kim et al.) dataset. n for each bar (left to right) = 10,746, 10,828, 10,921, 10,654. Error bars = mean +/- SD (g) Editing efficiency with different RTT overhang lengths (3, 7, 10, 15 bp) in GtoC Pos. 5 edits of library 1 for a direct comparison to identical edits in the DeepPE dataset. n for each bar (left to right) = 135, 135, 137, 133. (f,g) Error bars = mean +/- SD (h,i) Evaluation of DeepPE model (n = 18) on 18/45 endogenous edits from this study in HEK293T (h) and K562 (i).
Extended Data Fig. 3
Extended Data Fig. 3. Additional validation of the Easy-Prime PE2 model.
(a) Edit type count distribution in the original Easy-Prime test dataset. (b) Evaluation of Easy-Prime PE2 model by testing this XGBoost model on the original Easy-Prime test dataset, filtered against 1bp edits at position 5 of the RTT to eliminate the bias towards this edit type. n = 585. (c-g) Evaluation of Easy-Prime PE2 by testing the model on datasets generated in this study. (c) Library 1 in HEK293T, n = 92,423. (d) Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in HEK293T, n = 915. (e) Library 2 (editing with PE2 and pegRNAs without tevopreQ1) in K562, n = 876. (f,g) Endogenous loci from Fig. 4a,b in HEK293T (f) and K562 (g), n = 45. (h) Intended editing efficiency rank of the best-predicted pegRNA for each pathogenic locus in library 1 (PRIDICT and Easy-Prime). Pathogenic loci with multiple pegRNAs on rank 1 (identical efficiency) and loci with less than 3 pegRNAs were excluded from this analysis. Predictions from PRIDICT were taken from 5 different cross-validations to ensure none of the predictions are included in the training set. n = 12,189. (i) Intended editing efficiency rank of the best-predicted pegRNA for each endogenous locus (PRIDICT and Easy-Prime). n = 15.
Extended Data Fig. 4
Extended Data Fig. 4. Additional library 2 evaluation with PEmax.
(a) Mean editing efficiencies of each replicate, including all pegRNAs in library 2 with different experimental conditions in U2OS and K562 cells. Error bars indicate the mean +/- SD of three biologically independent replicates. n = 3. Mean editing of library 2 for each of the 3 replicates is based on the following number of pegRNAs for each data point (bars left to right) = 916, 922, 917, 924, 879, 869, 877, 866. Note that absolute levels of editing efficiency for PEmax cannot be directly compared to PE2 in this study due to the use of different selection agents (Blasticidin for PEmax screens compared to Zeocin for PE2 screens). Previous studies showed that in identical setups, PEmax surpasses the performance of PE2. (b) Spearman correlation for PEmax editing efficiencies in library 2 between different experimental conditions (MLH1dn, tevopreQ1) and cell lines (K562, U2OS). (c) Editing efficiency rank correlations (Spearman) in library 2 between editing performed with PE2 vs. editing performed with PEmax.
Fig. 1
Fig. 1. High-throughput screen for determinants of prime-editing efficiency.
(a) pegRNA visualization with different pegRNA domains. Example edit is depicted in red (A:T) at position 5 of the RTT domain, leading to a G to T base change in the target DNA. (b) Self-targeting construct with the promoter (hU6) and different pegRNA domains, target sequence, and primer location for NGS-PCR (Fw and Rv primer). (c) Visualization of the workflow of the self-targeting screen in HEK293T cells. (d) Effect of the maximum number of consecutive 'T' in spacer sequence or pegRNA extension on editing efficiency. (e-f) Comparison of edits with different insertion- and deletion-length on editing efficiency. (g) Effect of replacement bases in single base replacement edits on editing efficiency. (h) Heatmap visualizing editing efficiency of pegRNAs (single base replacements) with different RTT overhang lengths (3, 7, 10, 15 bp) and edit positions (1-15); PAM position highlighted with black rectangle. (i) Heatmap visualizing unintended editing rate of pegRNAs (single base replacements) with different RTT overhang lengths (3, 7, 10, 15 bp) and edit positions (1-15); PAM position highlighted with black rectangle. The number of analyzed pegRNA-target combinations are as follows: (d) n = 12,005, 42,611, 24,574, 9,223, 2,918, 1,092. (e) n = 21,882, 2,934, 1,103, 2,501. (f) n = 5,457, 459, 115, 52. (g) n = 4,482, 5,892, 6,576, 3,622, 5,252, 3,768, 4,868, 4,644, 4,864, 5,767, 4,832, 3,353. (h-i) n = 57,920. Boxplots in (d-g) represent the 25th, 50th, and 75th percentiles. Whiskers indicate 5 and 95 percentiles.
Fig. 2
Fig. 2. Prediction of pegRNA editing rates by an attention-based bi-directional recurrent neural network.
(a) Illustration of the attention-based bi-directional recurrent neural network. Rectangles with circles represent vectors. (b) Comparison of ML model performances on editing efficiency prediction (mean of five-fold cross-validation). (c) Comparison of ML model performances on unintended editing rate prediction (five-fold cross-validation). (b,c) Each of the 5 cross-validations is visualized as individual data point (n = 5). Error bar = mean +/- SD (d) Evaluation of PRIDICT-AttnBiRNN model by comparing measured editing efficiency of one cross-validation data set with predicted efficiency. (e) Evaluation of PRIDICT-AttnBiRNN model by comparing measured unintended editing rate of one cross-validation data set with predicted unintended editing efficiency. (d-e) The black line corresponds to the least-squares polynomial fit. n = 18,485.
Fig. 3
Fig. 3. Feature importance overview for editing prediction.
(a) SHAP analysis on the test dataset of the XGBoost model visualizes the top 10 features influencing the outcome of the editing prediction. A high SHAP value associates with higher editing prediction. Feature values correspond to the values of each individual feature. Detailed list of all features is listed in Supplementary Table 1. (b) Analysis of the importance of individual sequence positions in the PRIDICT AttnBiRNN model. A high IG score associates with a high impact on editing prediction. (c) Average positive/negative IG contribution of individual positions. Large bases have a stronger effect on prediction.
Fig. 4
Fig. 4. Validation of PRIDICT on endogenous loci and external datasets.
(a,b) Editing on endogenous loci with 45 pegRNAs, targeting 15 loci (3 pegRNAs per locus). (a) Endogenous editing in HEK293T cells compared to PRIDICT score. (b) Endogenous editing in K562 cells compared to PRIDICT score. (c) PRIDICT prediction performance on unintended edits. (d) Comparison of pegRNAs (mean of 3 biological replicates) with PRIDICT score lower or higher than 70 in self-targeting library 1 test-dataset (n = 15,119 (<70), n = 2,059 (>70)), on endogenous edits in HEK293T/K562 (n = 30 (<70), n = 15 (>70)) and the Anzalone et al. dataset (n = 112 (<70), n = 69 (>70)). Mann-Whitney U rank test (two-sided): p < 0.001 (<1*10-16, 2.3*10-6, 5.6*10-6, 1.2*10-16 (from left to right)). Boxplots represent the 25th, 50th, and 75th percentiles. Whiskers indicate 5 and 95 percentiles. Individual data points are illustrated for endogenous experiments as grey dots. (e) Performance of the different prediction models on diverse edit types. The Spearman correlation for PRIDICT was calculated from data plotted in Fig. 2b (PRIDICT on test dataset of library 1). The Spearman correlations for the Position and Type models were retrieved from Kim et al. (Position model on the Position-test data and Type model on the Type-test data). The Spearman correlation of Easy-Prime PE2 was calculated by analyzing the model on the Easy-Prime test dataset filtered against edits at position 5 of the RTT to eliminate the strong bias towards this edit type in their dataset (see Extended Data Fig. 3).
Fig. 5
Fig. 5. Evaluation of MLH1dn and tevopreQ1 effect on PE2 editing efficiency and PRIDICT performance in library 2.
(a) Mean editing efficiencies of each replicate, including all pegRNAs in library 2 with different experimental conditions. Error bars indicate the mean +/- SD of three biologically independent replicates. n = 3. Mean editing of the library for each of the 3 replicates is based on the following number of pegRNAs for each data point (bars left to right) = 927, 925, 925, 926, 887, 887, 918, 926, 875, 866, 874, 866, 880, 928. (b) Spearman correlation between different experimental conditions and cell lines (HEK293T, K562, and U2OS). (c) Comparison of the editing efficiency in the mouse liver to other experimental conditions and the PRIDICT score. (d) Comparison of pegRNAs (mean of 3 biological replicates) with PRIDICT score lower or higher than 70 in the library 2 test-dataset (unmodified pegRNAs n(PRIDICT score < 70) = 771, n(PRIDICT score > 70) = 98, tevopreQ1 pegRNAs n(PRIDICT score < 70) = 803, n(PRIDICT score > 70) = 108). Mann-Whitney U rank test (two-sided): p = 3.7*10-11 and 2.7*10-13. Boxplots represent the 25th, 50th, and 75th percentiles. Whiskers indicate 5 and 95 percentiles. (e) Ratio of mean editing efficiency of pegRNAs in library 2 if base after the edit is an “A” vs. a “G”, “C”, or “G”. (f) Ratio of mean editing efficiency of pegRNAs in library 2 if the edited base is a “G” or “C” vs. an “A” or “T”.

References

    1. Anzalone AV, et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature. 2019;576:149–157. - PMC - PubMed
    1. Hsu JY, et al. PrimeDesign software for rapid and simplified design of prime editing guide RNAs. Nat Commun. 2021;12:1034. - PMC - PubMed
    1. Hwang G-H, et al. PE-Designer and PE-Analyzer: web-based design and analysis tools for CRISPR prime editing. Nucleic Acids Res. 2021;49:W499–W504. - PMC - PubMed
    1. Kim HK, et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat Biotechnol. 2021;39:198–206. - PubMed
    1. Li Y, Chen J, Tsai SQ, Cheng Y. Easy-Prime: a machine learning–based prime editor design tool. Genome Biol. 2021;22:235. - PMC - PubMed

Publication types