Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 17:15:191.
doi: 10.1186/1471-2105-15-191.

Efficient design of meganucleases using a machine learning approach

Affiliations

Efficient design of meganucleases using a machine learning approach

Mikhail Zaslavskiy et al. BMC Bioinformatics. .

Abstract

Background: Meganucleases are important tools for genome engineering, providing an efficient way to generate DNA double-strand breaks at specific loci of interest. Numerous experimental efforts, ranging from in vivo selection to in silico modeling, have been made to re-engineer meganucleases to target relevant DNA sequences.

Results: Here we present a novel in silico method for designing custom meganucleases that is based on the use of a machine learning approach. We compared it with existing in silico physical models and high-throughput experimental screening. The machine learning model was used to successfully predict active meganucleases for 53 new DNA targets.

Conclusions: This new method shows competitive performance compared with state-of-the-art in silico physical models, with up to a fourfold increase in terms of the design success rate. Compared to experimental high-throughput screening methods, it reduces the number of screening experiments needed by a factor of more than 100 without affecting final performance.

PubMed Disclaimer

Figures

Figure 1
Figure 1
I-CreI/DNA binding interface. (A) Natural I-CreI target site with all positions indexed with respect to the center of the site from -11 to 11. -11NNNN and -5NNN are the reverse-complements of 11N4 and 5N3. (B) 3D structure of the I-CreI/DNA complex (PDB code: 1g9y). (C) I-CreI/DNA interaction map. Columns correspond to position on the DNA, rows correspond to positions of protein residues. Colors in the table are used to describe the nature of interaction between residues and nucleotides: dark green – backbone interactions, blue – water mediated, red – base specific. Residues N30-S40 and Q44-D75 are clustered together to indicate that they contact separate regions 11N4 and 5N3 on the DNA target.
Figure 2
Figure 2
Cross-validation performance of various in silico methods. (Left) %Top10 — percentage of targets with at least one positive molecule in Top10 ranked, (Right) AUC – AUC score (see Material and Methods) Mact - predictions made on the basis of module cleavage activities, Fx — FoldX score, Rt — Rosetta score, SeqMact — protein/target sequences + module cleavage activities, SeqMactFxStr — all features combined (sequences + module cleavage activities + FoldX scores and interactions). Error bars are estimated from 30 independent cross-validation experiments.
Figure 3
Figure 3
Performance of ML model as a function of training set composition. (Left) Performance of ML model as a function of the training set size (i.e. number of combinatorial libraries). Experimental setting are similar to those presented in Figure 2, where each point corresponds to the cross-validation performance when we use only a portion of the training data. (Right) Success rate as a function of the minimal distance between test and training targets (1, 2, 3) – distance in number of bases, (100%, 80%, 20%) – proportion of the training set which is kept after removal of targets which are too similar to targets in the test set. Distance subsampling – distance based selection of targets, Uniform subsampling – random selection of equivalent size training set; r gives the drop (ratio) in performance score due to the distance based selection of training targets.
Figure 4
Figure 4
Cross-validation performance of ML model as a function of interaction features. (Left) %Top10 — percentage of targets with at least one positive molecule in Top10 ranked. Description of various groups of features (SM-5, SM-11, SM-5_11, SM-M2M, SM-M2T, SM-Cross, SM-Intra and SeqMact) are given in the text. Error bars are estimated from 30 independent cross-validation experiments. (Right) Prediction of active mutants at least as specific as the wild type I-CreI. Top10 — avg. number of active proteins at least as specific as I-CreI in top10 ranked molecules, α — trade-off parameter between predicted specificity and activity of candidate proteins. Seq – machine learning model trained on protein/target sequences, Fx – FoldX score.
Figure 5
Figure 5
Success rate of meganuclease design methods. (Left) Experimental results on targets sampled from ETS (extended target space). (Right) Experimental results on targets sampled from RTS (restricted target space). SeqMact - machine learning predictions, SeqMact + — machine learning predictions with additional I132V mutation, Comb — combinatorial libraries. GTAC — proportion of GTAC target variants with at least one positive mutant, ORIG — proportion of original (sampled) targets with at least one positive mutant, ORIGstrong — proportion of original (sampled) targets with at least one highly active mutant (normalized cleavage activity score above 0.8).

Similar articles

Cited by

References

    1. Umezawa T, Fujita M, Fujita Y, Yamaguchi-Shinozaki K, Shinozaki K. Engineering drought tolerance in plants: discovering and tailoring genes to unlock the future. Curr Opin Biotechnol. 2006;17(2):113–122. doi: 10.1016/j.copbio.2006.02.002. - DOI - PubMed
    1. Lee SK, Chou H, Ham TS, Lee TS, Keasling JD. Metabolic engineering of microorganisms for biofuels production: from bugs to synthetic biology to fuels. Curr Opin Biotechnol. 2008;19(6):556–563. doi: 10.1016/j.copbio.2008.10.014. - DOI - PubMed
    1. Silva G, Poirot L, Galetto R, Smith J, Montoya G, Duchateau P, Paques F. Meganucleases and other tools for targeted genome engineering: perspectives and challenges for gene therapy. Curr Gene Ther. 2011;11(1):11–27. doi: 10.2174/156652311794520111. - DOI - PMC - PubMed
    1. Boch J, Scholze H, Schornack S, Landgraf A, Hahn S, Kay S, Lahaye T, Nickstadt A, Bonas U. Breaking the code of DNA binding specificity of TAL-type III effectors. Science. 2009;326(5959):1509–1512. doi: 10.1126/science.1178811. - DOI - PubMed
    1. Jiang W, Bikard D, Cox D, Zhang F, Marraffini LA. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat Biotechnol. 2013;31(3):233–239. doi: 10.1038/nbt.2508. - DOI - PMC - PubMed

MeSH terms