Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Oct 16;13(10):e1005807.
doi: 10.1371/journal.pcbi.1005807. eCollection 2017 Oct.

A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action

Affiliations

A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action

Shiran Abadi et al. PLoS Comput Biol. .

Abstract

The adaptation of the CRISPR-Cas9 system as a genome editing technique has generated much excitement in recent years owing to its ability to manipulate targeted genes and genomic regions that are complementary to a programmed single guide RNA (sgRNA). However, the efficacy of a specific sgRNA is not uniquely defined by exact sequence homology to the target site, thus unintended off-targets might additionally be cleaved. Current methods for sgRNA design are mainly concerned with predicting off-targets for a given sgRNA using basic sequence features and employ elementary rules for ranking possible sgRNAs. Here, we introduce CRISTA (CRISPR Target Assessment), a novel algorithm within the machine learning framework that determines the propensity of a genomic site to be cleaved by a given sgRNA. We show that the predictions made with CRISTA are more accurate than other available methodologies. We further demonstrate that the occurrence of bulges is not a rare phenomenon and should be accounted for in the prediction process. Beyond predicting cleavage efficiencies, the learning process provides inferences regarding patterns that underlie the mechanism of action of the CRISPR-Cas9 system. We discover that attributes that describe the spatial structure and rigidity of the entire genomic site as well as those surrounding the PAM region are a major component of the prediction capabilities.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Schematic flow of the cross-validation procedures.
The main components of the learning pipeline for the leave-one-sgRNA-out and leave-study-out cross-validation procedures are presented. 1 This step was applied to the leave-one-sgRNA-out procedure only. 2 In each iteration, the samples of a single sgRNA (in the case of the leave-one-sgRNA-out procedure) or all samples from a single study (in the case of leave-study-out) were excluded from the training data and used as a test set. The algorithm was trained on the rest of the data. 3 Each set of cleaved samples (targets that correspond to a single sgRNA) was oversampled using bootstrapping, thus introducing a subset twice the size of the original one, and an equal-sized set of uncleaved samples was randomly chosen. 4 For each original set of cleaved samples in the test set (targets that correspond to a single sgRNA), an equal-sized set of uncleaved samples was randomly chosen.
Fig 2
Fig 2. Comparison of four prediction algorithms on the assembled dataset.
(a-d) Pearson correlation coefficient computed over all the samples in the dataset. The horizontal axis represents the scaled observed values published in the experimental studies, and the vertical axis represents the scores predicted by: (a) CRISTA applied using cross-validation (r2 = 0.65), (b) CCTop (r2 = 0.23), (c) OptCD (r2 = 0.13), (d) CFD score (r2 = 0.52). (e) Receiver Operator Characteristics curves computed over all the samples in the test dataset: CRISTA (AUC = 0.96), CCTop (AUC = 0.85), OptCD (AUC = 0.85), CFD score (AUC = 0.91). Positives and negatives represent cleaved and uncleaved sites, respectively. True (and false) positives rate is computed as the true-positives (false-positive) number divided by the number of positives (negatives). (f) Precision-Recall curves computed over all the samples in the dataset: CRISTA (AUC = 0.96), CCTop (AUC = 0.87), OptCD (AUC = 0.88), CFD score (AUC = 0.93). Precision is computed as the true-positive number divided by the sum of true-positives and false-positives. Recall is computed as the true-positives number divided by the positives number. (g) Pearson correlation coefficient computed for each sgRNA: CRISTA (averaged r2 = 0.80, sd = 0.13), CCTop (averaged r2 = 0.46, sd = 0.22), OptCD (averaged r2 = 0.32, sd = 0.28), CFD score (averaged r2 = 0.65, sd = 0.28). (h) Receiver Operator Characteristics curves computed for each sgRNA: CRISTA (averaged AUC = 0.99, sd = 0.02), CCTop (averaged AUC = 0.86, sd = 0.13), OptCD (averaged AUC = 0.9, sd = 0.12), CFD score (averaged AUC = 0.9, sd = 0.11). (i) Precision-Recall curves computed for each sgRNA: CRISTA (averaged AUC = 0.99, sd = 0.02), CCTop (averaged AUC = 0.92, sd = 0.09), OptCD (averaged AUC = 0.93, sd = 0.07), CFD score (averaged AUC = 0.94, sd = 0.06). Mean values are marked with horizontal lines. The whiskers reach 1.5 times past the first and third quartiles.
Fig 3
Fig 3. Accuracy across different studies in a leave-study-out cross-validation.
(a) Observed cleavage intensities versus predicted intensities. The top and bottom rows represent the nuclear targets of the ‘unique guides’ and ‘common guides’, respectively. Pearson r2 values are shown; "overall" represents the correlation calculated by taking all points, and "mean" is the average correlation calculated for each sgRNA individually. Different colors represent nuclear targets of different sgRNAs. (b, c) ROC and PRC curves. The ‘unique guides’ and ‘common guides’ of each study are represented by different curves. AUC values are denoted in the legend. Each column corresponds to a single experimental platform.
Fig 4
Fig 4. Features importance.
Clustering of top-ranked features and their relative importance. The nodes sizes represent the feature importance as calculated by CRISTA. Edges transparencies represent correlation such that strongly correlated features are connected by darker edges. Yellow and blue edges represent positively and negatively correlated features respectively. Abbreviations: YY- mismatches of type pyrimidine-pyrimidine; RR–mismatches of type purine-purine; MGW–minor groove width; ‘#’ represents counts (for further explanations of the features, see S3 Table). The graph was produced with Cytoscape [55] using the pairwise correlation for every pair of features and their importance scores.
Fig 5
Fig 5. Forward selection results.
The top plot represents the ROC-AUC, PRC-AUC, r2, and root mean square error (RMSE) following the addition of every feature from left to right. The bars represent feature importance, i.e., the contribution of every feature to the prediction accuracy as computed by the Random Forest algorithm. The RMSE is divided by two for visualization.

References

    1. Mali P, Yang L, Esvelt KM, Aach J, Guell M, DiCarlo JE, et al. RNA-guided human genome engineering via Cas9. Science. 2013;339: 823–6. doi: 10.1126/science.1232033 - DOI - PMC - PubMed
    1. Jinek M, East A, Cheng A, Lin S, Ma E, Doudna J. RNA-programmed genome editing in human cells. Elife. eLife Sciences Publications Limited; 2013;2: e00471 doi: 10.7554/eLife.00471 - DOI - PMC - PubMed
    1. Jiang W, Bikard D, Cox D, Zhang F, Marraffini LA. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat Biotechnol. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2013;31: 233–9. doi: 10.1038/nbt.2508 - DOI - PMC - PubMed
    1. Cong L, Ran FA, Cox D, Lin S, Barretto R, Habib N, et al. Multiplex genome engineering using CRISPR/Cas systems. Science. 2013;339: 819–23. doi: 10.1126/science.1231143 - DOI - PMC - PubMed
    1. Hwang WY, Fu Y, Reyon D, Maeder ML, Tsai SQ, Sander JD, et al. Efficient genome editing in zebrafish using a CRISPR-Cas system. Nat Biotechnol. Nature Publishing Group; 2013;31: 227–229. doi: 10.1038/nbt.2501 - DOI - PMC - PubMed

Substances