Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;41(7):btaf385.
doi: 10.1093/bioinformatics/btaf385.

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction

Affiliations

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction

Yalin Hou et al. Bioinformatics. .

Abstract

Motivation: Accurate prediction of single-guide RNA (sgRNA) activity is crucial for optimizing the CRISPR/Cas9 gene-editing system, as it directly influences the efficiency and accuracy of genome modifications. However, existing prediction methods mainly rely on large-scale experimental data of a single Cas9 variant to construct Cas9 protein (variants)-specific sgRNA activity prediction models, which limits their generalization ability and prediction performance across different Cas9 protein (variants), as well as their scalability to the continuously discovered new variants.

Results: In this study, we proposed PLM-CRISPR, a novel deep learning-based model that leverages protein language models to capture Cas9 protein (variants) representations for cross-variant sgRNA activity prediction. PLM-CRISPR uses tailored feature extraction modules for both sgRNA and protein sequences, incorporating a cross-variant training strategy and a dynamic feature fusion mechanism to effectively model their interactions. Extensive experiments demonstrate that PLM-CRISPR outperforms existing methods across datasets spanning seven Cas9 protein (variants) in three real-world scenarios, demonstrating its superior performance in handling data-scarce situations, including cases with few or no samples for novel variants. Comparative analyses with traditional machine learning and deep learning models further confirm the effectiveness of PLM-CRISPR. Additionally, motif analysis reveals that PLM-CRISPR accurately identifies high-activity sgRNA sequence patterns across diverse Cas9 protein (variants). Overall, PLM-CRISPR provides a robust, scalable, and generalizable solution for sgRNA activity prediction across diverse Cas9 protein (variants).

Availability and implementation: The source code can be obtained from https://github.com/CSUBioGroup/PLM-CRISPR.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the PLM-CRISPR framework for predicting sgRNA activity across Cas9 protein (variants). The framework takes two types of input: the sgRNA sequence and the protein variant sequence. Each type of biological data undergoes specialized preprocessing modules. For sgRNA sequences, one-hot encoding is used to generate initial representations, which are then processed by multi-layer CNNs for feature extraction. Cas9 protein (variants) sequences are encoded using PLM ESM2 and further processed through a TextCNN. The features from both paths are dynamically weighted and integrated for the final classification. The “SpCas9-HF1_WT mutations” in the left panel refers to the mutation sites of the Cas9 variant SpCas9-HF1 relative to the wild-type Streptococcus pyogenes Cas9 (WT-SpCas9), as is the case for the other variants.
Figure 2.
Figure 2.
Comparison of Spearman correlation coefficients between variant-specific training and cross-variant training strategy for each dataset.
Figure 3.
Figure 3.
Heatmap of Spearman correlation coefficients for PLM-CRISPR compared with traditional machine learning baselines (top) and classical deep learning baselines (bottom).
Figure 4.
Figure 4.
Schematic illustration and performance comparison of PLM-CRISPR with existing sgRNA activity prediction methods across three application scenarios. (a) Schematic of the well-established variant scenario, modeling on well-established variants with sufficient training data. (b) Spearman correlation coefficient comparisons in the simulated well-established variant scenario. (c) Schematic of the simulated newly identified variant scenario, involving variants with limited training data. (d) Spearman correlation coefficient comparisons in the newly identified variant scenario. (e) Schematic of the simulated newly discovered scenario, where testing is conducted on variants with no training data. (f) Spearman correlation coefficient comparisons in the newly discovered variant scenario.
Figure 5.
Figure 5.
Motif enrichment analysis of high-activity and low-activity sgRNAs based on experimental measured sgRNA activity scores (top) and PLM-CRISPR-predicted sgRNA activity scores (bottom) across different Cas9 protein (variants). (a) evoCas9. (b) HypaCas9. (c) xCas9. (d) SniperCas9. (e) eSpCas9(1,1). (f) SpCas9-HF1. (g) WT-SpCas9.

Similar articles

Cited by

References

    1. Allemailem KS, Alsahli MA, Almatroudi A et al. Current updates of CRISPR/Cas9-mediated genome editing and targeting within tumor cells: an innovative strategy of cancer management. Cancer Commun 2022;42:1257–87. - PMC - PubMed
    1. Anzalone AV, Randolph PB, Davis JR et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 2019;576:149–57. - PMC - PubMed
    1. Bae S, Park J, Kim J-S. Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics 2014;30:1473–5. - PMC - PubMed
    1. Casini A, Olivieri M, Petris G et al. A highly specific SpCas9 variant is identified by in vivo screening in yeast. Nat Biotechnol 2018;36:265–71. - PMC - PubMed
    1. Chen JS, Dagdas YS, Kleinstiver BP et al. Enhanced proofreading governs CRISPR–Cas9 targeting accuracy. Nature 2017;550:407–10. - PMC - PubMed

MeSH terms

Substances