Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 7;12(3):409.
doi: 10.3390/biom12030409.

CNN-XG: A Hybrid Framework for sgRNA On-Target Prediction

Affiliations

CNN-XG: A Hybrid Framework for sgRNA On-Target Prediction

Bohao Li et al. Biomolecules. .

Abstract

As the third generation gene editing technology, Crispr/Cas9 has a wide range of applications. The success of Crispr depends on the editing of the target gene via a functional complex of sgRNA and Cas9 proteins. Therefore, highly specific and high on-target cleavage efficiency sgRNA can make this process more accurate and efficient. Although there are already many sophisticated machine learning or deep learning models to predict the on-target cleavage efficiency of sgRNA, prediction accuracy remains to be improved. XGBoost is good at classification as the ensemble model could overcome the deficiency of a single classifier to classify, and we would like to improve the prediction efficiency for sgRNA on-target activity by introducing XGBoost into the model. We present a novel machine learning framework which combines a convolutional neural network (CNN) and XGBoost to predict sgRNA on-target knockout efficacy. Our framework, called CNN-XG, is mainly composed of two parts: a feature extractor CNN is used to automatically extract features from sequences and predictor XGBoost is applied to predict features extracted after convolution. Experiments on commonly used datasets show that CNN-XG performed significantly better than other existing frameworks in the predicted classification mode.

Keywords: Crispr/Cas9; XGBoost; deep learning; on-target; sgRNA.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflict of interest or ethics statement to declare.

Figures

Figure 1
Figure 1
Implementation details of CNN-XG. (a) sgRNA and epigenetic information sequence encoding schema. There are four bases in nucleotides, A, G, C and T, each of which is seen as a channel, and each piece of epigenetic information is also seen as a channel. (b) Training and feature extraction in CNN. (c) The features extracted by the CNN are further selected using random forest models. (d) The selected features are put into the XGBoost classifier for the final prediction. (e) The structure of the convolutional part. The network contains two structurally identical branches for extracting sgRNA and epigenetic features. The final fully connected layer is used to obtain the final output.
Figure 2
Figure 2
The features obtained by CNN are scored using random forest for importance (the latter 32 features are not shown because of low importance scores). X-axis represents the analyzed features, and Y-axis denotes the importance scores.
Figure 3
Figure 3
Evaluation of CNN-XG for on-target profile prediction. (ac) Comparison of sgRNA on-target efficacy predictions for HCT116, HELA, HEK293T and HCT116 in 10-fold cross-validation in classification mode and regression mode. (d,e) Comparison of sgRNA on-target efficacy predictions in regression and classification schema for various datasets, i.e., HCT116 cell line, HELA cell line, HEK293T cell line, HCT116 cell line. (f) The result of a generalization ability test in new cell lines.
Figure 4
Figure 4
Heatmap of Spearman coefficients between CNN-XG and other recent algorithms on various SpCas9 variants datasets under 10-fold cross-validation. The prediction models are placed vertically, whereas the test sets are arranged horizontally. (The tested spearman coefficients of C-RNNCrispr, DeepCas9 and DeepSpCas9 were derived from CRISPR-ONT [34]).

Similar articles

Cited by

References

    1. Jansen R., Van Embden J.D.A., Gaastra W., Schouls L.M. Identification of genes that are associated with DNA repeats in prokaryotes. Mol. Microbiol. 2002;43:1565–1575. doi: 10.1046/j.1365-2958.2002.02839.x. - DOI - PubMed
    1. Deltcheva E., Chylinski K., Sharma C.M., Gonzales K., Chao Y., Pirzada Z.A., Eckert M.R., Vogel J., Charpentier E. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature. 2011;471:602–607. doi: 10.1038/nature09886. - DOI - PMC - PubMed
    1. Mojica F.J.M., Díez-Villaseñor C., García-Martínez J., Almendros C. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology. 2009;155:733–740. doi: 10.1099/mic.0.023960-0. - DOI - PubMed
    1. Hsu P.D., Scott D.A., Weinstein J.A., Ran F.A., Konermann S., Agarwala V., Li Y., Fine E.J., Wu X., Shalem O., et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 2013;31:827–832. doi: 10.1038/nbt.2647. - DOI - PMC - PubMed
    1. Guilinger J.P., Thompson D.B., Liu D.R. Fusion of catalytically inactive Cas9 to FokI nuclease improves the specificity of genome modification. Nat. Biotechnol. 2014;32:577–582. doi: 10.1038/nbt.2909. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances