Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 19;22(1):235.
doi: 10.1186/s13059-021-02458-0.

Easy-Prime: a machine learning-based prime editor design tool

Affiliations

Easy-Prime: a machine learning-based prime editor design tool

Yichao Li et al. Genome Biol. .

Abstract

Prime editing is a revolutionary genome-editing technology that can make a wide range of precise edits in DNA. However, designing highly efficient prime editors (PEs) remains challenging. We develop Easy-Prime, a machine learning-based program trained with multiple published data sources. Easy-Prime captures both known and novel features, such as RNA folding structure, and optimizes feature combinations to improve editing efficiency. We provide optimized PE design for installation of 89.5% of 152,351 GWAS variants. Easy-Prime is available both as a command line tool and an interactive PE design server at: http://easy-prime.cc/ .

Keywords: Machine learning; Prime editor; pegRNA design.

PubMed Disclaimer

Conflict of interest statement

S.Q.T. is a member of the scientific advisory board of Kromatid and Twelve Bio.

Figures

Fig. 1
Fig. 1
Overview of Easy-Prime design and machine learning model evaluation. a Cas9 activity feature is predicted by DeepSpCas9 score (purple box). (2) Oligo features (yellow box) are the GC content and sequence length of the PBS and RTT. (3) Target mutation features (cyan box) are whether the target mutation disrupts the PAM sequence, whether the ngRNA spacer sequence matches to the edited protospacer sequence, and the numbers of mismatches, deletions, and insertions. (4) Position features (pink box) are the distance between the ngRNA and the sgRNA (ngRNA_pos), the distance between the target mutation and the sgRNA (Target_pos), and the number of nucleotides downstream of the desired edit (target_end_flank). (5) RNA folding features are the maximal pairing probability between each of the first 10 bp of the RTT and the scaffold sequence based on RNAplfold [29]. b A machine learning workflow for data preprocessing, feature extraction, and model training and evaluation. c and d are correlation scatter plots of the true PE efficiency (x-axis) and the predicted efficiency (y-axis). c Train-test-split evaluation for the PE2 model and nested cross-validation evaluation for the PE3 model. d An independent PE data used for a third-party data evaluation for the PE3 model. “R” is spearman correlation coefficient. “r” is Pearson correlation coefficient
Fig. 2
Fig. 2
Features associated with PE efficiency. a, b Feature importance plot of the XGBoost regression model. Feature rankings are based on the mean absolute SHAP value for the PE2 and PE3 model. RNA folding features are combined for simplified visualization. Target_end_flank: number of nucleotides from target mutation to the end of RTT sequence. Target_pos: distance between target mutation and sgRNA nick site. ngRNA_pos: distance between ngRNA nick site and sgRNA nick site. c Schematic view of RNA-folding disruption score formulation. On the left, a pegRNA sequence consisting of an sgRNA (red), a scaffold sequence (orange), and an RTT sequence (green) is labeled with positions and nucleotides, such as 81G. The pairing probability between 81G and the first position in the RTT sequence is denoted as P(1,81). On the right is a heatmap of the pairing probability between each position in the scaffold and the 3′ extension sequence (i.e., RTT + PBS). P(1,81) is highlighted by a red dashed box. At bottom left, the formula to calculate D(i) is shown, where i represents the position in the 3′ extension. d Line plot showing the trend of correlations between the first 16 positions in the 3′ extension and the targeted editing frequency
Fig. 3
Fig. 3
Experimental validation of PE designs by Easy-Prime. a Barplot showing the observed editing efficiencies to install a positive control (HEK3 + 1TtoG, blue bar) and 7 blood variants predicted by Easy-Prime (pink bars). Replicates are represented by grey dots. b Barplot showing paired PE design comparison between Easy-Prime prediction (pink bars) and PrimeDesign recommendation (cyan bars)
Fig. 4
Fig. 4
The web portal interface for Easy-Prime. a Screenshot of the Easy-Prime web portal (based on DASH [33]). Easy-Prime takes a file in vcf or fasta format as input. It searches and optimizes all individual sgRNA-PBS-RTT-ngRNA combinations and visualizes the gRNAs with the highest predicted efficiency for each input variant. b An interactive PE design visualization based on the ProteinPaint genome browser

References

    1. Pickar-Oliver A, Gersbach CA. The next generation of CRISPR–Cas technologies and applications. Nat. Rev. Mol. Cell Biol. 2019;20(8):490–507. doi: 10.1038/s41580-019-0131-5. - DOI - PMC - PubMed
    1. Yin H, Xue W, Anderson DG. CRISPR-Cas: a tool for cancer research and therapeutics. Nat. Rev. Clin. Oncol. 2019;16(5):281–295. doi: 10.1038/s41571-019-0166-8. - DOI - PubMed
    1. High KA, Roncarolo MG. Gene Therapy. N Engl J Med. 2019;381:455–64. Gene Therapy, 5, 10.1056/NEJMra1706910. - PubMed
    1. Anzalone AV, Koblan LW, Liu DR. Genome editing with CRISPR–Cas nucleases, base editors, transposases and prime editors. Nat. Biotechnol. 2020;38:824–44. - PubMed
    1. Gaudelli NM, Komor AC, Rees HA, Packer MS, Badran AH, Bryson DI, Liu DR. Programmable base editing of A•T to G•C in genomic DNA without DNA cleavage. Nature. 2017;551(7681):464–471. doi: 10.1038/nature24644. - DOI - PMC - PubMed

Publication types