Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 16;37(10):1352-1359.
doi: 10.1093/bioinformatics/btaa984.

Casboundary: automated definition of integral Cas cassettes

Affiliations

Casboundary: automated definition of integral Cas cassettes

Victor A Padilha et al. Bioinformatics. .

Abstract

Motivation: CRISPR-Cas are important systems found in most archaeal and many bacterial genomes, providing adaptive immunity against mobile genetic elements in prokaryotes. The CRISPR-Cas systems are encoded by a set of consecutive cas genes, here termed cassette. The identification of cassette boundaries is key for finding cassettes in CRISPR research field. This is often carried out by using Hidden Markov Models and manual annotation. In this article, we propose the first method able to automatically define the cassette boundaries. In addition, we present a Cas-type predictive model used by the method to assign each gene located in the region defined by a cassette's boundaries a Cas label from a set of pre-defined Cas types. Furthermore, the proposed method can detect potentially new cas genes and decompose a cassette into its modules.

Results: We evaluate the predictive performance of our proposed method on data collected from the two most recent CRISPR classification studies. In our experiments, we obtain an average similarity of 0.86 between the predicted and expected cassettes. Besides, we achieve F-scores above 0.9 for the classification of cas genes of known types and 0.73 for the unknown ones. Finally, we conduct two additional study cases, where we investigate the occurrence of potentially new cas genes and the occurrence of module exchange between different genomes.

Availability and implementation: https://github.com/BackofenLab/Casboundary.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Examples of the structure of CRISPR cassettes: (a) single CRISPR cassette; and (b) single CRISPR cassette with a gap. The signature genes are in bold. Blue arrows are interference genes while purple arrows are adaptation genes
Fig. 2.
Fig. 2.
Examples of the structure of multi-module CRISPR cassettes: (a) multi-module cassette without overlap; and (b) multi-module cassette with overlap. The signature genes are in bold. The blue and red arrows are interference genes, yellow arrows are processing genes and purple arrows are adaptation genes
Fig. 3.
Fig. 3.
Histogram containing 100 equally sized bins of the Jaccard Similarity and Loss for single cassette prediction using ERT (a, b) and DNN (c, d). The inner figures are the zoom of the corresponding outer ones without considering the most dominant bin
Fig. 4.
Fig. 4.
Examples of our method’s cassette prediction for the organism Thermotoga sp. RQ2. Specifically, it found two cassettes composed by single interference modules, represented by the orange and green arrows, and a multi-module cassette with two interference modules (blue and red arrows) and an adaptation module (purple arrows). See Figure S3 for more details
Fig. 5.
Fig. 5.
Comparison of Cas type prediction F-scores between our models (using a combination of the specific HMM and protein properties features) and CRISPRCasFinder. For a comparison between the runtime of Casboundary and CRISPRCasFinder, see Supplementary Table S3
Fig. 6.
Fig. 6.
Examples of the application of our method for the identification of potentially new Cas proteins, which are marked in bold. In (a), our method predicted two proteins as ‘new’, where one of them has some similarity with Cas8 proteins and may be a new subfamily of Cas8. In (b), our method predicted two proteins as ‘new’, which do not have any similarity to other known Cas proteins and may indicate two new genes
None

References

    1. Alkhnbashi O.S. et al. (2016) Characterizing leader sequences of crispr loci. Bioinformatics, 32, i576–i585. - PubMed
    1. Alkhnbashi O.S. et al. (2020) CRISPR-cas bioinformatics. Methods, 172, 3–11. - PubMed
    1. Alkhnbashi O.S. et al. (2014) CRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci. Bioinformatics (Oxford, England), 30, i489–496. - PMC - PubMed
    1. Bateman A. (2004) The pfam protein families database. Nucleic Acids Res., 32, D138–D141. - PMC - PubMed
    1. Couvin D. et al. (2018) CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res., 46, W246–W251. - PMC - PubMed

Publication types

MeSH terms