. 2007 Jan 20:8:18.

doi: 10.1186/1471-2105-8-18.

PILER-CR: fast and accurate identification of CRISPR repeats

Robert C Edgar¹

Affiliations

PMID: 17239253
PMCID: PMC1790904
DOI: 10.1186/1471-2105-8-18

PILER-CR: fast and accurate identification of CRISPR repeats

Robert C Edgar. BMC Bioinformatics. 2007.

. 2007 Jan 20:8:18.

doi: 10.1186/1471-2105-8-18.

Author

Robert C Edgar¹

Affiliation

¹ bob@drive5.com

PMID: 17239253
PMCID: PMC1790904
DOI: 10.1186/1471-2105-8-18

Abstract

Background: Sequencing of prokaryotic genomes has recently revealed the presence of CRISPR elements: short, highly conserved repeats separated by unique sequences of similar length. The distinctive sequence signature of CRISPR repeats can be found using general-purpose repeat- or pattern-finding software tools. However, the output of such tools is not always ideal for studying these repeats, and significant effort is sometimes needed to build additional tools and perform manual analysis of the output.

Results: We present PILER-CR, a program specifically designed for the identification and analysis of CRISPR repeats. The program executes rapidly, completing a 5 Mb genome in around 5 seconds on a current desktop computer. We validate the algorithm by manual curation and by comparison with published surveys of these repeats, finding that PILER-CR has both high sensitivity and high specificity. We also present a catalogue of putative CRISPR repeats identified in a comprehensive analysis of 346 prokaryotic genomes.

Conclusion: PILER-CR is a useful tool for rapid identification and classification of CRISPR repeats. The software is donated to the public domain. Source code and a Linux binary are freely available at http://www.drive5.com/pilercr.

PubMed Disclaimer

Figures

**Figure 1**
**Structure of a CRISPR array**. CRISPR repeats are perfectly (or almost perfectly) conserved short sequences, typically of length 20 to 40 bases, separated by unique sequences known as spacers. The spacer length in a given array is sometimes approximately conserved, varying by a few bases, and sometimes exactly conserved. The spacer length is typically similar to the repeat length.

**Figure 2**
**Dot-plot of a CRISPR array against itself**. Self-similarity plot ("dot-plot") of a genome against itself in a CRISPR array region. The main diagonal is shown as a dashed line. As the two axes represent the same sequence, local alignments (diagonal lines) are symmetrical about the main diagonal.

**Figure 3**
**Pile construction**. When local alignments are projected onto the genome, "piles" are produced. A pile is a contiguous sequence of bases, each one of which has a hit to at least one other region in the genome. Bases that are not in a pile are unique sequence. Each local alignment connects two piles. In this figure, each hit has a different color so, for example, the purple hit connects the first and second pile.

**Figure 4**
**A chain of hits meeting CRISPR criteria**. CRISPR arrays are identified by following chains of hits. Starting with a given pile, each hit that connects this pile to another pile later in the genome is a potential link in the chain. All possible chains are explored, abandoning the search each time the chain violates the criteria used for CRISPR array recognition (see Table 2). These criteria include maximum and minimum repeat length, maximum and minimum spacer length, and measures of the variance in repeat and spacer lengths. Shorter links are explored before longer links as regularly spaced arrays will be obtained by skipping every second, third... repeat. The figure shows the correct chain (arrows) for the example array from Fig. 3.

See this image and copyright information in PMC

References

1. Jansen R, van Embden JD, Gaastra W, Schouls LM. Identification of a novel family of sequence repeats among prokaryotes. Omics. 2002;6:23–33. doi: 10.1089/15362310252780816. - DOI - PubMed
1. Mojica FJ, Diez-Villasenor C, Soria E, Juez G. Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Mol Microbiol. 2000;36:244–246. doi: 10.1046/j.1365-2958.2000.01838.x. - DOI - PubMed
1. Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, Kerlavage AR, Dougherty BA, Tomb JF, Adams MD, Reich CI, Overbeek R, Kirkness EF, Weinstock KG, Merrick JM, Glodek A, Scott JL, Geoghagen NS, Venter JC. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996;273:1058–1073. doi: 10.1126/science.273.5278.1058. - DOI - PubMed
1. DeBoy RT, Mongodin EF, Emerson JB, Nelson KE. Chromosome evolution in the Thermotogales: large-scale inversions and strain diversification of CRISPR sequences. J Bacteriol. 2006;188:2364–2374. doi: 10.1128/JB.188.7.2364-2374.2006. - DOI - PMC - PubMed
1. Haft DH, Selengut J, Mongodin EF, Nelson KE. A Guild of 45 CRISPR-Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes Exist in Prokaryotic Genomes. PLoS Comput Biol. 2005;1:e60. doi: 10.1371/journal.pcbi.0010060. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PILER-CR: fast and accurate identification of CRISPR repeats

Affiliation

PILER-CR: fast and accurate identification of CRISPR repeats

Author

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous