. 2023 Oct;41(10):1446-1456.

doi: 10.1038/s41587-023-01678-y. Epub 2023 Feb 16.

Prediction of prime editing insertion efficiencies using sequence features and DNA repair determinants

Affiliations

¹ Wellcome Sanger Institute, Hinxton, UK.
² Department of Computer Science, University of Tartu, Tartu, Estonia.
³ Wellcome Sanger Institute, Hinxton, UK. leopold.parts@sanger.ac.uk.
⁴ Department of Computer Science, University of Tartu, Tartu, Estonia. leopold.parts@sanger.ac.uk.

^# Contributed equally.

PMID: 36797492
PMCID: PMC10567557
DOI: 10.1038/s41587-023-01678-y

Prediction of prime editing insertion efficiencies using sequence features and DNA repair determinants

Jonas Koeppel et al. Nat Biotechnol. 2023 Oct.

. 2023 Oct;41(10):1446-1456.

doi: 10.1038/s41587-023-01678-y. Epub 2023 Feb 16.

Authors

Affiliations

¹ Wellcome Sanger Institute, Hinxton, UK.
² Department of Computer Science, University of Tartu, Tartu, Estonia.
³ Wellcome Sanger Institute, Hinxton, UK. leopold.parts@sanger.ac.uk.
⁴ Department of Computer Science, University of Tartu, Tartu, Estonia. leopold.parts@sanger.ac.uk.

^# Contributed equally.

PMID: 36797492
PMCID: PMC10567557
DOI: 10.1038/s41587-023-01678-y

Abstract

Most short sequences can be precisely written into a selected genomic target using prime editing; however, it remains unclear what factors govern insertion. We design a library of 3,604 sequences of various lengths and measure the frequency of their insertion into four genomic sites in three human cell lines, using different prime editor systems in varying DNA repair contexts. We find that length, nucleotide composition and secondary structure of the insertion sequence all affect insertion rates. We also discover that the 3' flap nucleases TREX1 and TREX2 suppress the insertion of longer sequences. Combining the sequence and repair features into a machine learning model, we can predict relative frequency of insertions into a site with R = 0.70. Finally, we demonstrate how our accurate prediction and user-friendly software help choose codon variants of common fusion tags that insert at high efficiency, and provide a catalog of empirically determined insertion rates for over a hundred useful sequences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. High-throughput measurement of prime insertion efficiencies.**
a, Screen setup. Set 1 and Set 2 libraries were screened separately and data merged (Methods); panels d–f reflect Set 1 results only. b, Library composition. The number of sequences in the library (y axis) with different insert sequence lengths (x axis, top panel) and %GC content (x axis, bottom panel). c, Experimental design. NGS, next generation sequencing. d, Editing frequencies. Average mutation frequency (y axis) for different screens (x axis) stratified by mutation type (blue, insertions; gray, unintended outcomes). Markers represent one replicate and bars the average across n = 3 biological replicates. e, Replicate concordance. Pearson’s R between insertion rates in two screens (x axis) for different comparisons (y axis, colors). Markers, correlation value of one pair of screens (for replicate correlations, mean of pairwise comparison across n = 3 biological replicates); line and whiskers, mean and s.e.m. f, Representative examples of categories from e. Percentage insertion in the *HEK3* locus in HEK293T cells (y axis) compared with values (x axis) in other contexts (panels, colors) for insertion sequences (markers). Left panel, comparison of biological replicates; other panels, comparison of replicate averages. Label, R of values in linear scale. Colors as in e.

**Fig. 2. Prime insertion efficiency depends on insert length and MMR.**
a, Insertion rate in HEK293T cells. Percentage of reads with insertion (y axis, cut-off at 3 s.d. above mean) for different insert sizes (x axis) of individual sequences (blue markers) and averages for lengths with at least 30 measured sequences (dark blue line and markers) at different target sites (panels). Data represent the average of n = 3 biological replicates. b, As a, but for HAP1 cells. c, As a, but for HAP1 *∆MLH1* cells. d, Insertion rate in one cell context (y axis) compared with in another context (x axis) at the *HEK3* target of individual sequences (markers), comparing HEK293T with HAP1 cells (left panel) and HEK293T cells with HAP1 *∆MLH1* cells (middle panel). Red, short sequences (up to 4 nt); blue, medium sequences (5–13 nt); teal, longer sequences (>13 nt). Label, R between rates. The data are an average from n = 3 biological replicates (HEK293T) or n = 2 biological replicates (HAP1). e, Average insertion rates (y axis) across insert lengths (x axis) with at least 30 measured sequences in various cell line contexts (colors). Data are presented as mean ± s.e.m. n = 3 biological replicates (HEK293T) or n = 2 biological replicates (HAP1). f, The ratio of relative insertion rates (Methods) at the *HEK3* locus between HAP1 *∆MLH1* and HAP1 cells (y axis) for different lengths (x axis) stratified by colors as in d. Box, median and quartiles; whiskers, least extreme of 1.5 times the interquartile range from the quartile and most extreme values. Line, fit from an exponential model (ratio ≈ a × exp(−b × length) + 1). n = 2 biological replicates.

**Fig. 3. Effects of prime editing steps.**
a, Schematic of molecular steps involved in prime editing. b, Normalized pegRNA count derived from sequencing of PCR amplicons from genomic DNA (x axis) or PCR amplicons from RNA (y axis) for the *HEK3* site in HEK293T cells for individual pegRNAs (markers). Pink, inserts with four or more consecutive adenines. Data represent the average of n = 3 biological replicates. c, Top panel, average insertion rate relative to length bin median (y axis) for inserts stratified by the longest consecutive run of adenines (x axis). Bottom panel, instead showing transcription rate (read counts from RNA/read counts from DNA) on the y axis. Data are presented as mean ± s.e.m. n = 3 biological replicates. d, Insertion frequencies at the *HEK3* site in HEK293T using the standard MMLV reverse transcriptase (PE2, x axis) and the FeLV reverse transcriptase (PE-FeLV, y axis) for different insertion sequences (markers). Colors, number of neighboring points. n = 3 biological replicates. e, As d, but comparing PE3 and PE2 at the *EMX1* site. f, Schematic of screens with overexpression constructs. g, Insertion frequencies for different overexpressions (y axis and panels) compared with no overexpression (x axis) for three biological replicate screens (markers) stratified by insertion sequence lengths (colors). h, Average insertion rates (y axis) across insert lengths (x axis) with at least 30 measured sequences for overexpression constructs (colors). Data are presented as mean ± s.e.m. n = 3 biological replicates. i, As h, but instead displaying the insertion rate fold changes of screens with overexpressions compared with no overexpression (y axis), calculated from the ratio of sums of all sequences (lines) or of ten randomly sampled sequences. j, Top, average insertion frequency (grayscale) of four sequences with varying lengths (x axis) when overexpressing eGFP stratified by homology arm lengths (panels). Bottom, insertion rate fold changes compared with eGFP (y axis) when overexpressing TREX1 and TREX2 (colors). n = 2 biological replicates. k, Fraction of the nontemplated adenine allele at the +9 position (y axis) for cells with overexpression constructs (x axis and colors) stratified by experiment and homology arm lengths (panels). Markers show screen averages from three biological replicates for the pooled screen or from separate pegRNAs for the individual validation experiment.

**Fig. 4. Cytosine content and secondary structure of the insert sequence are positively correlated with the insertion rate.**
a, Correlation of length-normalized insertion rate with nucleotide frequency in the insert (colors) for each nucleotide (y axis) in each screen (x axis). Data represent the average of n = 3 (HEK293T) or n = 2 (HAP1) biological replicates. b, As a, but for a new set of screens with 18-nt inserts and 15-nt homology arms targeting five novel sites within 1 kb of the *HEK3* site. c, Insertion rate at the *HEK3* site in HEK293T cells relative to length bin median (y axis) for inserts (markers) with different cytosine content (x axis). Line, linear regression fit; shaded area, 95% posterior confidence interval of the fit. Data represent the average of n = 3 biological replicates. d, Insertion rates at the *HEK3* site in HEK293T cells relative to length bin median (y axis) for inserts (markers) with calculated Gibbs free energy (∆G) from ViennaFold (x axis). Line, linear regression fit; shaded area, 95% posterior confidence interval of the fit. Data represent the average of n = 3 biological replicates. e, Correlation (x axis) between insertion rates and insert sequence free energy calculated from different parts of the 3′ extension (y axis). Box, median and quartiles; whiskers, least extreme of 1.5 times the interquartile range from the quartile and most extreme values. n = 3 (HEK293T) or n = 2 (HAP1) biological replicates. f, Insertion rates for sequences (markers) at the *HEK3* site in HEK293T for pegRNAs (x axis) and epegRNAs (y axis). Data represent the average of n = 3 biological replicates. g, Percentage increase in insertion rate with each standard deviation increase in structure strength (colors) for different overexpression constructs (x axis) and insertion sequence lengths (y axis). h, Insertion rates relative to length bin median (y axis) for sequences that disrupt or preserve (x axis) scaffold loops (panels). Colored lines show screen medians and the thicker black lines and dots show the median across all screens. i, The predicted secondary structure of a 66-nt insert sequence (ELMI003108) with the *HEK3* homology arm.

**Fig. 5. Predicting prime insertion efficiencies.**
a, Schematic representation of model features. b, Tenfold cross-validation model performance on the training set (y axis) using different feature sets. System: MMR proficiency and Oligo(A) length. Sequence effects: length, reverse transcriptase template (RTT) structure, nucleotide composition and all of them combined (‘Total’). Model: combination of ten features. Extra: 53 features. Dashed line, median of ‘Model’. Box, median and quartiles. Whiskers, 1.5 times interquartile range. c, Feature importance. Left, distribution of SHAP values (x axis) for each feature (y axis, colors). Right, respective mean absolute SHAP values (x axis). d, Concordance of predicted (y axis) and observed (x axis) insertion efficiencies on the held-out test set (markers). Solid line, y = x. Label, Pearson’s R. An additional 18 points are beyond the plot limits (Supplementary Fig. 12). e, Concordance of predicted and observed values at new sites. Pearson’s R between predicted and observed normalized insertion efficiencies (y axis) for 356–388 18-nt sequences inserted into six different sites within the HEK3 locus (left bars) and 66 codon variants of six protein tags into nine sites in HEK293T cells (right bars). Line, performance on the dataset from d. f, Mean replicate correlation (light gray) ±s.e.m. and concordance of predicted and observed rates (yellow) on 6- and 9-nt insertions (63 and 1,908 sequences, respectively) at the TAPE-1 target from (ref. ). g, Distribution of Pearson’s R between observed and predicted insertion rates (x axis) of seven insertions into 134 loci from (ref. ). Dashed line, median. h–j, Measured insertion rates of predicted high- and low-inserting codon versions of six protein tags into nine sites. h, Measurements of insertion rate relative to mean insertion rate of codon sequences (y axis, colors) separated into predicted to be highly and lowly inserting (x axis). i, Insertion rates (x axis) of codon variants (markers) of six protein tags (y axis) into the NOLC1 site in HEK293T cells. Red, large predicted rate; blue, low predicted rate. Bar and whiskers, mean ± s.e.m. j, Concordance of observed and predicted insertion rates of all sequences for all target sites and codon variants. k, Effect of padding. Insertion rates (y axis) of three sequences (x axis) inserted without modification (gray) and padded with optimally predicted sequences to 18 nt (green).

See this image and copyright information in PMC

References

1. Anzalone AV, et al. Programmable deletion, replacement, integration and inversion of large DNA sequences with twin prime editing. Nat. Biotechnol. 2022;40:731–740. - PMC - PubMed
1. Yarnall MTN, et al. Drag-and-drop genome insertion of large sequences without double-strand DNA cleavage using CRISPR-directed integrases. Nat. Biotechnol. 2022 doi: 10.1038/s41587-022-01527-4. - DOI - PMC - PubMed
1. Landrum MJ, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44:D862–D868. - PMC - PubMed
1. Landrum MJ, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46:D1062–D1067. - PMC - PubMed
1. Geurts MH, et al. Evaluating CRISPR-based prime editing for cancer modeling and CFTR repair in organoids. Life Sci. Alliance. 2021;4:e202000940. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- Addgene Non-profit plasmid repository

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction of prime editing insertion efficiencies using sequence features and DNA repair determinants

Affiliations

Prediction of prime editing insertion efficiencies using sequence features and DNA repair determinants

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials