. 2025 Jul 1;41(7):btaf385.

doi: 10.1093/bioinformatics/btaf385.

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction

Yalin Hou¹, Yiming Li¹, Ruiqing Zheng¹, Fuhao Zhang², Fei Guo¹, Min Li¹, Min Zeng¹

Affiliations

¹ School of Computer Science and Engineering, Central South University, Changsha 410083, China.
² College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China.

PMID: 40600900
PMCID: PMC12254127
DOI: 10.1093/bioinformatics/btaf385

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction

Yalin Hou et al. Bioinformatics. 2025.

. 2025 Jul 1;41(7):btaf385.

doi: 10.1093/bioinformatics/btaf385.

Authors

Yalin Hou¹, Yiming Li¹, Ruiqing Zheng¹, Fuhao Zhang², Fei Guo¹, Min Li¹, Min Zeng¹

Affiliations

¹ School of Computer Science and Engineering, Central South University, Changsha 410083, China.
² College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China.

PMID: 40600900
PMCID: PMC12254127
DOI: 10.1093/bioinformatics/btaf385

Abstract

Motivation: Accurate prediction of single-guide RNA (sgRNA) activity is crucial for optimizing the CRISPR/Cas9 gene-editing system, as it directly influences the efficiency and accuracy of genome modifications. However, existing prediction methods mainly rely on large-scale experimental data of a single Cas9 variant to construct Cas9 protein (variants)-specific sgRNA activity prediction models, which limits their generalization ability and prediction performance across different Cas9 protein (variants), as well as their scalability to the continuously discovered new variants.

Results: In this study, we proposed PLM-CRISPR, a novel deep learning-based model that leverages protein language models to capture Cas9 protein (variants) representations for cross-variant sgRNA activity prediction. PLM-CRISPR uses tailored feature extraction modules for both sgRNA and protein sequences, incorporating a cross-variant training strategy and a dynamic feature fusion mechanism to effectively model their interactions. Extensive experiments demonstrate that PLM-CRISPR outperforms existing methods across datasets spanning seven Cas9 protein (variants) in three real-world scenarios, demonstrating its superior performance in handling data-scarce situations, including cases with few or no samples for novel variants. Comparative analyses with traditional machine learning and deep learning models further confirm the effectiveness of PLM-CRISPR. Additionally, motif analysis reveals that PLM-CRISPR accurately identifies high-activity sgRNA sequence patterns across diverse Cas9 protein (variants). Overall, PLM-CRISPR provides a robust, scalable, and generalizable solution for sgRNA activity prediction across diverse Cas9 protein (variants).

Availability and implementation: The source code can be obtained from https://github.com/CSUBioGroup/PLM-CRISPR.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of the PLM-CRISPR framework for predicting sgRNA activity across Cas9 protein (variants). The framework takes two types of input: the sgRNA sequence and the protein variant sequence. Each type of biological data undergoes specialized preprocessing modules. For sgRNA sequences, one-hot encoding is used to generate initial representations, which are then processed by multi-layer CNNs for feature extraction. Cas9 protein (variants) sequences are encoded using PLM ESM2 and further processed through a TextCNN. The features from both paths are dynamically weighted and integrated for the final classification. The “SpCas9-HF1_WT mutations” in the left panel refers to the mutation sites of the Cas9 variant SpCas9-HF1 relative to the wild-type *Streptococcus pyogenes* Cas9 (WT-SpCas9), as is the case for the other variants.

**Figure 2.**
Comparison of Spearman correlation coefficients between variant-specific training and cross-variant training strategy for each dataset.

**Figure 3.**
Heatmap of Spearman correlation coefficients for PLM-CRISPR compared with traditional machine learning baselines (top) and classical deep learning baselines (bottom).

**Figure 4.**
Schematic illustration and performance comparison of PLM-CRISPR with existing sgRNA activity prediction methods across three application scenarios. (a) Schematic of the well-established variant scenario, modeling on well-established variants with sufficient training data. (b) Spearman correlation coefficient comparisons in the simulated well-established variant scenario. (c) Schematic of the simulated newly identified variant scenario, involving variants with limited training data. (d) Spearman correlation coefficient comparisons in the newly identified variant scenario. (e) Schematic of the simulated newly discovered scenario, where testing is conducted on variants with no training data. (f) Spearman correlation coefficient comparisons in the newly discovered variant scenario.

**Figure 5.**
Motif enrichment analysis of high-activity and low-activity sgRNAs based on experimental measured sgRNA activity scores (top) and PLM-CRISPR-predicted sgRNA activity scores (bottom) across different Cas9 protein (variants). (a) evoCas9. (b) HypaCas9. (c) xCas9. (d) SniperCas9. (e) eSpCas9(1,1). (f) SpCas9-HF1. (g) WT-SpCas9.

See this image and copyright information in PMC

Cited by

2OMe-LM: predicting 2'-O-methylation sites in human RNA using a pre-trained RNA language model.
Liu Q, Zeng M, Li Y, Lu C, Kan S, Guo F, Li M. Liu Q, et al. Bioinformatics. 2025 Aug 2;41(8):btaf417. doi: 10.1093/bioinformatics/btaf417. Bioinformatics. 2025. PMID: 40728934 Free PMC article.

References

1. Allemailem KS, Alsahli MA, Almatroudi A et al. Current updates of CRISPR/Cas9-mediated genome editing and targeting within tumor cells: an innovative strategy of cancer management. Cancer Commun 2022;42:1257–87. - PMC - PubMed
1. Anzalone AV, Randolph PB, Davis JR et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 2019;576:149–57. - PMC - PubMed
1. Bae S, Park J, Kim J-S. Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics 2014;30:1473–5. - PMC - PubMed
1. Casini A, Olivieri M, Petris G et al. A highly specific SpCas9 variant is identified by in vivo screening in yeast. Nat Biotechnol 2018;36:265–71. - PMC - PubMed
1. Chen JS, Dagdas YS, Kleinstiver BP et al. Enhanced proofreading governs CRISPR–Cas9 targeting accuracy. Nature 2017;550:407–10. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction

Affiliations

Leveraging protein language models for cross-variant CRISPR/Cas9 sgRNA activity prediction

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources