Fig. 1. Characterization and prediction of pegRNA efficiencies based on sequence context.
(a) Schematic overview of the screen with the target-matched pegRNA library 'Library Diverse'. (b-g) Editing efficiency for (b,c) insertions, for (d,e) 1-5bp replacements, and (f,g) 1-15 bp deletions in HEK293T and K562, respectively. (h,i) Editing efficiencies in HEK293T (h) or K562 (i) cells for double edits where 2 separated 1 bp replacements were installed. Intended editing means that both replacements were installed, whereas intermediate editing means that only 1 of the 2 replacements was installed. Distance of 0 corresponds to single 1 bp edits. (j,k) Editing efficiency of single and double 1 bp replacements with or without editing of at least 1 base within the GG PAM sequence in HEK293T (j) and K562 (k) cells. (d,e,h-k) Bars include only pegRNAs with 7, 10, or 15 bp RTT overhang to ensure similar RTT overhang distributions between conditions. (b-k) Bars show mean with error bar indicating mean +/- s.e.m. (l,m) Heatmap visualizing editing efficiency of pegRNAs (single base replacements) in Library-Diverse with different RTT overhang lengths and edit positions in HEK293T (l; n = 3,079) and K562 (m, n = 3,091). PAM position is highlighted with black dotted rectangle. (n) Schematic illustration of PRIDICT2.0, which is an ensemble model based on the prediction average of two models: (Model A), base trained on Library 1 and fine-tuned on Library-Diverse (HEK293T and K562), and (Model B), base trained on Library 1 and Library-ClinVar and again fine-tuned on Library-Diverse. The number of pegRNAs in each dataset is indicated above or below the datasets. (o,p) Performance of PRIDICT2.0 on Library-Diverse (5-fold cross-validation) for (o) HEK293T (n = 22,619) and (p) K562 (n = 22,752) cells. Color gradient from dark purple to yellow indicates increasing point density, per Gaussian KDE. The black line corresponds to the least-squares polynomial fit. (q) Spearman correlations of editing efficiency in Library-Diverse in different contexts (HEK293T, K562, K562-MLH1dn, and in vivo mouse liver), and with PRIDICT2.0 HEK293T and K562 prediction scores. (r,s) SHAP analysis on a Library-Diverse test dataset of an XGBoost model (top fifteen features). A high SHAP value associates with higher editing prediction. Feature values correspond to the values of each individual feature. Detailed list of all features is listed in Supplementary Table 1. The number of analyzed pegRNA-target combinations are as follows. b, n = 408, 519, 541, 446, 447, 416, 433, 446, 432, 424, 361, 373, 359, 332, 346. c, n = 410, 525, 546, 445, 448, 414, 434, 444, 434, 426, 364, 375, 360, 331, 346. d, n = 5,146, 834, 638, 538, 576. e, n = 5,158, 844, 644, 554, 575. f, n = 424, 475, 388, 367, 406, 420, 380, 418, 405, 408, 391, 375, 365, 340, 362. g, n = 420, 476, 394, 373, 409, 420, 385, 416, 411, 409, 391, 385, 370, 348, 366. h, n = 5,146, 834, 153, 71, 191, 136, 27, 141, 155, 26, 185. i, n = 5,158, 844, 150, 71, 186, 137, 31, 142, 162, 28, 193. j, n = 4,606, 474, 2,818, 820. k, n = 4,606, 474, 2,818, 820.