Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May;43(5):712-719.
doi: 10.1038/s41587-024-02268-2. Epub 2024 Jun 21.

Machine learning prediction of prime editing efficiency across diverse chromatin contexts

Affiliations

Machine learning prediction of prime editing efficiency across diverse chromatin contexts

Nicolas Mathis et al. Nat Biotechnol. 2025 May.

Erratum in

Abstract

The success of prime editing depends on the prime editing guide RNA (pegRNA) design and target locus. Here, we developed machine learning models that reliably predict prime editing efficiency. PRIDICT2.0 assesses the performance of pegRNAs for all edit types up to 15 bp in length in mismatch repair-deficient and mismatch repair-proficient cell lines and in vivo in primary cells. With ePRIDICT, we further developed a model that quantifies how local chromatin environments impact prime editing rates.

PubMed Disclaimer

Conflict of interest statement

Competing interests: G.S. is a scientific advisor to Prime Medicine. The other authors declare no competing interests.

Figures

Extended Data Fig. 1
Extended Data Fig. 1. Library-Diverse characteristics.
(a) Edit type distribution in Library 1 from Mathis et al. 2023, which had a focus on 1bp replacements and short insertions and deletions. (b) Edit type distribution in 'Library-Diverse' screened in this study. (c) Self-targeting construct with the promoter (hU6) and different pegRNA domains (spacer, scaffold, reverse transcription template/RTT, primer binding sequence/PBS, tevopreQ1 motif, and poly T stop signal), target sequence (including position of Protospacer), and primer location for NGS-PCR (forward (Fw) and reverse (Rv) primer). (d-g) Correlation of background-subtracted individual replicates of 'Library-Diverse' prime editing screens in (d) HEK293T (n = 22,619), (e) K562 (n = 22,752), (f) K562 cells with MMR inhibition through MLH1dn expression (n = 20,477), and (g) in vivo (mouse liver hepatocytes; n = 17,775). Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE.
Extended Data Fig. 2
Extended Data Fig. 2. Machine learning metrics for training models on 'Library-Diverse'.
(a, b) Comparison of 7 different machine learning model performances on editing efficiency prediction in HEK293T (Spearman (a), Pearson (b)) and K562 (Spearman (c), Pearson (d)). Bars show the mean of fivefold cross-validation, and each of the five cross-validations is visualized as individual data points (n=5). Error bar indicates the mean +/- s.d. (e) Prediction Pearson correlations of PRIDICT and PRIDICT2.0, tested on different edit types and cell types. n for rows from top to bottom: 5,957, 4,455, 6,283, 5,924, 5,969, 4,508, 6,302, 5,973. (f) Prediction Pearson correlations of PRIDICT and PRIDICT2.0, tested on insertions and deletions of different lengths in HEK293T and K562 cells. n for different edit lengths combined are as follows: HEK293T insertions: 6,283, HEK293T deletions: 5,924, K562 insertions: 6,302, K562 deletions: 5,973. (g) Prediction Spearman correlations of PRIDICT, the updated attention-based bi-directional RNN architecture trained on Library 1 only (see Fig1n), or on Library 1 and Library-ClinVar, and PRIDICT2.0 (includes fine-tuning on Library-Diverse); tested on different edit and cell types. n for rows from top to bottom: 22,619, 5,957, 4,455, 6,283, 5,924, 22,752, 5,969, 4,508, 6,302, 5,973.
Extended Data Fig. 3
Extended Data Fig. 3. Editing characteristics in K562 with MMR inhibition (MLH1dn) and in vivo (mouse liver).
(a-f) Editing efficiencies of different edit/pegRNA features in K562 with MMR inhibition (MLH1dn). (g-l) Editing efficiencies of different edit/pegRNA features in the in vivo (mouse liver) setting. Editing efficiency for different edit lengths of (a,g) insertions and (b,h) deletions. (c,i) Heatmap visualizing editing efficiency of pegRNAs (1bp replacements) in Library-Diverse with different RTT overhang lengths (3, 7, 10, 15 bp) and edit positions (1–15). PAM position is highlighted with black dotted rectangle. (d,j) Editing efficiency of replacements with edit lengths of 1 to 5 bp. (e,k) Editing efficiency of single and double 1 bp replacements with or without editing of at least 1 base within the GG PAM sequence. (f,l) Editing efficiencies for double edits where 2 separated 1 bp replacements were installed. Intended editing means that both replacements were installed, whereas intermediate editing means that only 1 of the 2 replacements was installed. Distance of 0 corresponds to single 1 bp edits. (a-l) Bars show mean with error bar indicating mean +/- s.e.m. The numbers of analyzed pegRNA-target combinations are as follows. a, n = 371, 492, 504, 409, 412, 382, 406, 405, 405, 397, 323, 340, 333, 295, 320. b, n = 384, 444, 365, 345, 370, 385, 347, 379, 379, 369, 354, 339, 318, 314, 320. c, n = 2,815. d, n = 4,520, 764, 588, 497, 526. e, n = 4,092, 428, 2,528, 761. f, n = 4,520, 764, 130, 65, 151, 117, 25, 119, 133, 25, 149. g, n = 315, 415, 439, 351, 338, 321, 339, 356, 340, 337, 277, 303, 286, 243, 277. h, n = 317, 367, 294, 275, 288, 308, 289, 324, 304, 302, 273, 285, 255, 259, 249. i, n = 2,362. j, n = 4,265, 653, 499, 408, 459. k, n = 3,889, 376, 2,289, 663. l, n = 4,265, 653, 134, 59, 155, 119, 24, 119, 138, 26, 159.
Extended Data Fig. 4
Extended Data Fig. 4. TRIP screen characteristics for different edit modalities.
(a-d) Distribution of editing efficiency across different TRIP reporter integrations for (a) PE, (b) ABE8e, (c) BE4max, and (d) SpCas9 genome editing. Dotted vertical line indicates mean editing efficiency. (e-h) Correlation (Pearson) of individual TRIP screening replicates for PE (n = 1,182) (e), ABE8e (n = 1,169) (f), BE4max (n = 1,194) (g), and SpCas9 (n = 1,196) (h) genome editing. (i,j) Correlation of replicate means between different edit modalities: Spearman (i), Pearson (j). Only barcode integrations available from all editors are used for analysis (n = 1,165).
Extended Data Fig. 5
Extended Data Fig. 5. Additional analysis of TRIP screens and predictive modeling of editing rates.
(a-c) UMAP projection based on chromatin characteristics of genomic locations in the TRIP library (n=1,165; corresponding to integrations with mappings to all editors), with editing efficiency overlay of (a) ABE8e, (b) BE4max, and (c) Cas9. (d-g) Visualization of chromatin characteristics of clusters defined in Fig3i. For each target/dataset type, we selected the averaging window with the largest deviation from the library mean. The relative difference to the library mean, calculated as the absolute difference between the cluster average and the library mean divided by the library mean is shown. (h) Evaluation of the ePRIDICT-light XGBoost model trained on a subset of 6 features. Predictions from 5 different cross-validation runs were combined. (n=1,182). (i) Spearman and Pearson correlation of XGBoost model prediction to editing efficiencies in TRIP library for prime editing, adenine base editing (ABE8e), cytosine base editing (BE4max), and SpCas9 genome editing. Bars show the mean of fivefold cross-validation, and each of the five cross-validations is visualized as individual data points (n=5). Error bar indicates the mean +/- s.d. (j) Validation of ePRIDICT on an independent dataset from Li et al., where one sequence was integrated and edited (CTT insertion) at 4,144 genomic locations. (h,j) Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE.
Extended Data Fig. 6
Extended Data Fig. 6. Validation of ePRIDICT at endogenous loci in K562, HEK293T and HepG2 cells.
(a) Spearman correlation analysis of ENCODE feature values for 19 selected endogenous loci, comparing datasets from K562 and HEK293T cells. (b) Validation of prime editing efficiency in HEK293T cells on endogenous loci with high (>50) or low (<35) ePRIDICT scores, normalized to editing on the reporter sequence. 1bp replacements (n-high: 9, n-low: 10), 4bp insertions (n-high: 9, n-low: 9), and 4bp deletions (n-high: 9, n-low: 10). (c-d) Validation of genome editing efficiency on endogenous loci normalized to editing on integrated reporter in K562 (c) and HEK293T (d) with high (>50, n = 8 (K562) and 9 (HEK293T)) and low (<35 n = 9 (K562) and 10 (HEK293T)) ePRIDICT values for ABE8e, BE4max, and Cas9. (e) Spearman correlation analysis of ENCODE feature values for 19 selected endogenous loci, comparing datasets from K562 and HepG2 cells. (f) Validation in HepG2 cells as described in b. 1bp replacements (n-high: 9, n-low: 10), 4bp insertions (n-high: 8, n-low: 8), and 4bp deletions (n-high: 9, n-low: 10). (g, h) Binning editing efficiency and predicted score from Fig3n,o into 3 categories each. Editing efficiency is binned into “Low” (n=92), “Middle” (n=27), and “High” (n=27) categories based on the cutoffs <5%, 5-20%, and >20%. The prediction score is binned in three even-sized tertiles. (g) PRIDICT2.0 K562 value as prediction score. (h) Combined PRIDICT2.0 K562 and ePRIDICT value (average of both scores) as prediction score. (i,j) Performance of PRIDICT2.0 HEK293T (i) or PRIDICT2.0 HEK293T in combination with ePRIDICT (j) in predicting the editing efficiency of 56 pegRNAs targeting endogenous loci in HEK293T. (k) Additional visualization of the performance of PRIDICT2.0 HEK293T alone or in combination with ePRIDICT on 56 pegRNAs targeting endogenous loci in HEK293T, including highly and poorly accessible loci. (l-n) Performance of PRIDICT2.0 K562 or PRIDICT2.0 K562 in combination with ePRIDICT in HepG2 (54 pegRNAs), as described for i-k. (b-d, f) Boxplots represent the 25th, 50th and 75th percentiles. Whiskers extend to points within 1.5 times the interquartile range from the quartiles.
Fig. 1
Fig. 1. Characterization and prediction of pegRNA efficiencies based on sequence context.
(a) Schematic overview of the screen with the target-matched pegRNA library 'Library Diverse'. (b-g) Editing efficiency for (b,c) insertions, for (d,e) 1-5bp replacements, and (f,g) 1-15 bp deletions in HEK293T and K562, respectively. (h,i) Editing efficiencies in HEK293T (h) or K562 (i) cells for double edits where 2 separated 1 bp replacements were installed. Intended editing means that both replacements were installed, whereas intermediate editing means that only 1 of the 2 replacements was installed. Distance of 0 corresponds to single 1 bp edits. (j,k) Editing efficiency of single and double 1 bp replacements with or without editing of at least 1 base within the GG PAM sequence in HEK293T (j) and K562 (k) cells. (d,e,h-k) Bars include only pegRNAs with 7, 10, or 15 bp RTT overhang to ensure similar RTT overhang distributions between conditions. (b-k) Bars show mean with error bar indicating mean +/- s.e.m. (l,m) Heatmap visualizing editing efficiency of pegRNAs (single base replacements) in Library-Diverse with different RTT overhang lengths and edit positions in HEK293T (l; n = 3,079) and K562 (m, n = 3,091). PAM position is highlighted with black dotted rectangle. (n) Schematic illustration of PRIDICT2.0, which is an ensemble model based on the prediction average of two models: (Model A), base trained on Library 1 and fine-tuned on Library-Diverse (HEK293T and K562), and (Model B), base trained on Library 1 and Library-ClinVar and again fine-tuned on Library-Diverse. The number of pegRNAs in each dataset is indicated above or below the datasets. (o,p) Performance of PRIDICT2.0 on Library-Diverse (5-fold cross-validation) for (o) HEK293T (n = 22,619) and (p) K562 (n = 22,752) cells. Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE. The black line corresponds to the least-squares polynomial fit. (q) Spearman correlations of editing efficiency in Library-Diverse in different contexts (HEK293T, K562, K562-MLH1dn, and in vivo mouse liver), and with PRIDICT2.0 HEK293T and K562 prediction scores. (r,s) SHAP analysis on a Library-Diverse test dataset of an XGBoost model (top fifteen features). A high SHAP value associates with higher editing prediction. Feature values correspond to the values of each individual feature. Detailed list of all features is listed in Supplementary Table 1. The number of analyzed pegRNA-target combinations are as follows. b, n = 408, 519, 541, 446, 447, 416, 433, 446, 432, 424, 361, 373, 359, 332, 346. c, n = 410, 525, 546, 445, 448, 414, 434, 444, 434, 426, 364, 375, 360, 331, 346. d, n = 5,146, 834, 638, 538, 576. e, n = 5,158, 844, 644, 554, 575. f, n = 424, 475, 388, 367, 406, 420, 380, 418, 405, 408, 391, 375, 365, 340, 362. g, n = 420, 476, 394, 373, 409, 420, 385, 416, 411, 409, 391, 385, 370, 348, 366. h, n = 5,146, 834, 153, 71, 191, 136, 27, 141, 155, 26, 185. i, n = 5,158, 844, 150, 71, 186, 137, 31, 142, 162, 28, 193. j, n = 4,606, 474, 2,818, 820. k, n = 4,606, 474, 2,818, 820.
Fig. 2
Fig. 2. Validation of PRIDICT2.0 predictions in different contexts and in comparison to existing models.
(a-d) Performance of editing efficiency prediction by PRIDICT and PRIDICT2.0. PRIDICT2.0 HEK293T prediction is used for HEK293T, and K562 prediction is used for K562. (a) Overall performance in HEK293T (n=22,619) and K562 (n=22,752). (b) Performance split by different edit types and cell types (n from top to bottom: 5,957, 4,455, 6,283, 5,924, 5,969, 4,508, 6,302, 5,973). (c) Performance on insertions and deletions with different lengths. n for different edit lengths combined are as follows: HEK293T insertions: 6,283, HEK293T deletions: 5,924, K562 insertions: 6,302, K562 deletions: 5,973. (d) Performance on endogenous editing datasets,,. n for Mathis et al. 2023 = 45, n for Anzalone et al. 2019 = 181, n for Brooks et al. 2023 = 59. The tested dataset from Brooks et al. consists of all editing efficiencies from their “insertion set” and “correction set #1, #2a, #2b, and #3” combined. All Brooks et al. pegRNAs were used in a PE5 (PEmax + nicking guide + MLH1 inhibition) setting. (e) Overview of the capability to predict different edit types of the prime editing efficiency prediction models PRIDICT2.0, DeepPrime, and MinsePIE. Green = prediction possible, Bright-green = prediction possible with limitations, Red = prediction not possible. *1: PRIDICT2.0 was trained on insertions <= 15 bp. *2: MinsePIE prediction is restricted to insertions at the nick position, and the model was originally built to predict relative insertion efficiencies for different insertion sequences within a specific target with constant PBS/RTT overhang. (f) Prediction performance of PRIDICT2.0 (HEK293T and K562) on Library-Small filtered for NGG PAM in various editing contexts (n for each column from left to right: 2,181, 2,109, 1,637, 2,200, 2,161, 1,926, 2,133, 2,152, 2,182, 1,909, 1,972, 2,023, 2,208, 2,178, 2,066, 2,040, 2,181, 2,128). (g) Prediction performance of DeepPrime models on Library-Diverse (this study) filtered for edits <= 3bp (n for each column from left to right: 10,715, 10,761, 9,653, 8,609) (h,i) Comparison of different prediction models by predicting endogenous loci. DePr = DeepPrime. (h) Editing datasets in HEK293T. Mathis: Endogenous editing dataset (PE2) from Mathis et al., filtered for <= 3bp edits. Yu-1: Endogenous dataset (PE2max) from Figure S2E in Yu et al. Yu-2: Endogenous dataset from BRCA2 pegRNAs (PE2max) in Figure 6F,G from Yu et al. Anza: Endogenous dataset with PE2 editing from Anzalone et al. n (for each column from left to right): 42, 39, 24, 181. (i) Editing datasets in K562 (Mathis et al. 2023, PE2, filtered for <=3 bp edits; n = 42) and HuH-7 cells. Brooks et al. dataset consists of all editing efficiencies from their “insertion set” and “correction set #1, #2a, #2b, and #3” combined (n = 59; PE5 setting).
Fig. 3
Fig. 3. Characterization and prediction of prime editing efficiency based on chromatin context.
(a) Schematic illustration of the TRIP library integrated by PiggyBac. TR: Terminal Repeats. (b) Schematic illustration of the TRIP screen in K562 cells. (c) Overview of TRIP reporter insertion locations with mapped prime editing efficiencies in the K562 genome. n = 1,182. (d) Context of the TRIP reporter integration sites. (e) Schematic illustration of all ENCODE datasets of K562 used in this study. TF: Transcription Factors. The number of different features is indicated, with the total number of datasets (accounting for multiple ENCODE contributions per feature) given in brackets. Total number of datasets: 455. (f) Illustration of averaging windows (100, 1,000, 2,000, and 5,000 bp) around mapped integrations over which chromatin datasets (ENCODE) are averaged for further analysis. (g) Overall Pearson correlation of a selection of chromatin characteristics (25/455) to editing efficiency (PE/prime editing, ABE8e, BE4max, and Cas9) across the TRIP library. The averaging window with the highest absolute correlation value with PE editing is shown for each feature. (h) UMAP projection based on all 455 ENCODE datasets and averaging windows of the TRIP library. Prime editing efficiency is shown via color scale. n = 1,165. (i) KMeans clustering on UMAP projection to cluster integrations into 4 groups (A to D). n = 1,165. (j) Average editing efficiency of integrations in each KMeans cluster for PE, ABE8e, BE4max, and Cas9. n per cluster: 380 (A), 267 (B), 349 (C), 169 (D). (k) Comparison of machine learning model performances on editing efficiency prediction with ePRIDICT on the TRIP library in K562 cells. Bars show the mean of fivefold cross-validation, and each of the five cross-validations is visualized as individual data points (n=5). Error bar indicates the mean +/- s.d. (l) Visualization of ePRIDICT XGBoost model predictions on PE TRIP library dataset (n = 1,182). Predictions from 5 cross-validations were combined for visualization. Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE. Dotted line: least-squares polynomial fit. (m) Validation of prime editing efficiency on endogenous loci with a high (>50) or low (<35) ePRIDICT score normalized to editing on the reporter sequence for 1bp replacements (in green; n-high: 9, n-low: 10), 4bp insertions (in blue; n-high: 9, n-low: 9) and 4bp deletions (in orange; n-high: 9, n-low: 10). Boxplots represent the 25th, 50th and 75th percentiles. Whiskers extend to points within 1.5 times the interquartile range from the quartiles. (n) Endogenous editing in K562 with a total of 146 pegRNAs (56 pegRNAs from m and 90 additional pegRNAs) compared to PRIDICT2.0 K562 score. Dotted line: least-squares polynomial fit. (o) Comparison of pegRNAs in n to combination (mean) of the PRIDICT2.0 K562 score and ePRIDICT score. Dotted line: least-squares polynomial fit. (p) Additional visualization of the prediction performance of PRIDICT2.0 alone or in combination with ePRIDICT (average of prediction values from both models) on 41 different endogenous loci with highly variable chromatin characteristics, targeted with 146 pegRNAs in K562 cells.

References

    1. Mathis N, et al. Predicting prime editing efficiency and product purity by deep learning. Nat Biotechnol. 2023;41:1151–1159. doi: 10.1038/s41587-022-01613-7. - DOI - PMC - PubMed
    1. Kim HK, et al. Predicting the efficiency of prime editing guide RNAs in human cells. Nat Biotechnol. 2021;39:198–206. - PubMed
    1. Koeppel J, et al. Prediction of prime editing insertion efficiencies using sequence features and DNA repair determinants. Nat Biotechnol. 2023;2023:1–11. doi: 10.1038/s41587-023-01678-y. - DOI - PMC - PubMed
    1. Yu G, et al. Prediction of efficiencies for diverse prime editing systems in multiple cell types. Cell. 2023;186:1–17. - PubMed
    1. Chen PJ, et al. Enhanced prime editing systems by manipulating cellular determinants of editing outcomes. Cell. 2021;184:5635–5652.:e29. doi: 10.1016/j.cell.2021.09.018. - DOI - PMC - PubMed

LinkOut - more resources