Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 24;382(6673):eadi1910.
doi: 10.1126/science.adi1910. Epub 2023 Nov 23.

Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering

Affiliations

Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering

Han Altae-Tran et al. Science. .

Abstract

Microbial systems underpin many biotechnologies, including CRISPR, but the exponential growth of sequence databases makes it difficult to find previously unidentified systems. In this work, we develop the fast locality-sensitive hashing-based clustering (FLSHclust) algorithm, which performs deep clustering on massive datasets in linearithmic time. We incorporated FLSHclust into a CRISPR discovery pipeline and identified 188 previously unreported CRISPR-linked gene modules, revealing many additional biochemical functions coupled to adaptive immunity. We experimentally characterized three HNH nuclease-containing CRISPR systems, including the first type IV system with a specified interference mechanism, and engineered them for genome editing. We also identified and characterized a candidate type VII system, which we show acts on RNA. This work opens new avenues for harnessing CRISPR and for the broader exploration of the vast functional diversity of microbial proteins.

PubMed Disclaimer

Conflict of interest statement

Competing interests: H.A-T., S.K. and F.Z. are co-inventors on U.S. provisional patent applications filed by the Broad Institute related to this work. F.Z. is a scientific advisor and cofounder of Editas Medicine, Beam Therapeutics, Pairwise Plants, Arbor Biotechnologies, and Aera Therapeutics. F.Z. is a scientific advisor for Octant.

Figures

Fig. 1.
Fig. 1.. Design and implementation of FLSHclust
(A) Schematic of applications of protein clustering in biology and bioinformatic. Archetypal examples of biological systems that could be found with genome mining approaches for CRISPR are shown, including CRISPR-Associated Rossmann Fold (CARF) proteins and transposon-linked genes. (B) Conceptual schematic of locality-sensitive hashing. In contrast to standard hash-based bucketing, locality-sensitive hashing allows similar, non-identical objects to be bucketed together. The specific family of hash functions shown in the example is randomized positional masking (bit masking) on sequences. This family functions by dropping specific positions in each kmer, where the positions are randomly selected per hash function. (C) Schematic of the steps of FLSHclust involving locality-sensitive hashing. First, all kmers are extracted from each protein. Then for each hash function, the hash function is applied to all kmers and kmers with the same hash value are grouped and then processed independently to determine which sequences will be aligned in the next step. (D) Optimized hash functions with no false negatives as calculated using Markov Chain Monte Carlo compared to standard randomized hash functions from the same family. Probability of bucketing two kmers together in one of the L hash tables as a function of the number of mismatches between the kmers is shown. The parameters used for the LSH family functions are L=24 hash functions, kmer length k=12, with 3 positions dropped per hash function. For the optimized hash functions, the target number of tolerated mismatches is 2, such that the family has no false negatives in identifying matches between kmers with up to 2 mismatch positions. (E) Clustering performance across different algorithms for clustering a 1M protein subset of the UniRef50 database. Linclust/F refers to linclust using 8001 kmers per protein, as opposed to the default of 20. FLSH refers to FLSHclust, with r=2 indicating two tolerated mismatches. Clustering performance shows the fraction of proteins that are grouped into a cluster of size 2 or more as a function of similarity to their nearest neighbors. (F) Scaling comparison of various clustering algorithms and FLSHclust against subsets of UniRef50. Above: compute time on 2 nodes each with 64CPUs. Below, average cluster size as a function of number of input sequences. *MMseqs2 on the full UniRef50 dataset required substantially more compute resources to complete within a week and thus was not included in the timing analysis. Theoretical scaling shown with big O notation. (G) Comparison of clustering algorithms as in E) except on the full UniRef50 dataset. Additionally, a cumulative distribution across all input proteins is shown. Asterisk refers to the clustering threshold of 30%.
Fig. 2.
Fig. 2.. Discovery of hundreds of rare novel CRISPR systems with a sensitive, scalable CRISPR association pipeline.
(A) Schematic of CRISPR discovery pipeline using no all-to-all comparisons. (B) Comparison of naive and enhanced CRISPR association scores for identifying CRISPR-associated clusters. Left: known Cas genes; right: all clusters. (C) Selection of CRISPR-associated clusters. Left: relative count of Cas (blue) vs non-Cas (gray) clusters as a function of enhanced CRISPR association score. An empirical threshold of 0.35 enhanced score was selected for identifying CRISPR-associated clusters. Right: relative count of all clusters with Neff ≥ 3. Dotted line demarcates the 0.35 enhanced score cutoff. ~130,000 clusters with an enhanced score ≥ 0.35 passed for further analysis. N CRs: number of non-redundant loci with CRISPR arrays. (D) Line graph: Number of proteins over time in the complete dataset including all projects from public data (JGI, NCBI, WGS, and EMBL, excluding MG-RAST). Bottom: Back-calculated times at which CRISPR-associated, non-singleton protein clusters appeared in the public dataset for selected systems. Cluster assignments are fixed across time using the 30% sequence identity clustering from FLSHclust. The appearance time of a cluster is the earliest time at which a minimum of 2 non-redundant, CRISPR-associated proteins from the cluster are present in the public dataset. The appearance time of a system (e.g., Cas9, etc.) is the earliest appearance time across all related clusters. For multi-gene systems, a signature gene was used to represent the entire system (Type I: Cas7, Type III: Csm3, Type IV: Csf2). The inferred appearance time values is an upper bound for the true CRISPR-associated cluster appearance time in the dataset.
Fig. 3.
Fig. 3.. Type IV-A CRISPR systems perform directional dsDNA unwinding and strand-specific cleavage.
(A) Locus diagram of the experimentally studied DinG-HNH system from Sulfitobacter sp. JL08. (B) Sequence logo for the PAM of DinG-HNH as determined by a plasmid depletion assay in E. coli. (C) Small RNA-seq of DinG-HNH effector complex RNP pulldown. (D) E. coli transformation assays with DinG-HNH and associated effector complex genes and cognate targets with or without the PAM identified in (B). (E) In vitro reconstituted DinG-HNH and associated effector complex RNP cleavage of linear dsDNA targets. Targets either contain the cognate target site at the 5′ or 3′ end of the target strand (TS) as indicated. Only targets on the 3′ end of the TS are cleaved. NTS: Non-target strand.
Fig. 4.
Fig. 4.. HNH-functionalized Cascade subunits perform precise, RNA-guided dsDNA cleavage.
(A) Locus diagram of the experimentally studied Cas8-HNH system from Selenomonas sp. isolate RGIG9219. (B) Locus diagram of the experimentally studied Cas5-HNH system from Candidatus Cloacimonetes bacterium. (C) Sequence logo for the PAM of Cas8-HNH as determined by a plasmid depletion assay in E. coli. (D) Sequence logo for the PAM of Cas5-HNH as determined by a plasmid depletion assay in E. coli. (E) Small RNA-seq of Cas8-HNH Cascade RNP pulldown. (F) Small RNA-seq of Cas5-HNH Cascade RNP pulldown. (G) In vitro reconstituted Cas8-HNH Cascade RNP cleavage of linear dsDNA targets, in the presence or absence of a cognate target and/or PAM. (H) In vitro reconstituted Cas5-HNH Cascade RNP cleavage of linear dsDNA targets, in the presence or absence of a cognate target and/or PAM. (I) Sanger sequencing of cleavage products generated by Cas8-HNH. (J) Sanger sequencing of cleavage products generated by Cas5-HNH. In both (I) and (J), the polymerase used exhibits non-templated incorporation of a terminal adenine, which results in a thymidine appearing at the end of the trace. (M) HEK293FT genome editing at 4 genomic loci by Cas8-HNH in the presence or absence of each Cascade subunit or cognate guideRNA, or with alanine mutation of HNH domain catalytic residues. Error bars denote SD. *P < 0.05 relative to non-targeting (NT) guide condition. T: Targeting guide. (N) HEK293FT genome editing at 4 genomic loci by Cas5-HNH in the presence or absence of each Cascade subunit or cognate guideRNA, or with alanine mutation of HNH domain catalytic residues. Error bars denote SD. *P < 0.05 relative to non-targeting (NT) guide condition. T: Targeting guide.
Fig. 5.
Fig. 5.. Candidate Type VII CRISPR system
(A) Locus diagram of the experimentally studied candidate VII system. (B) UPGMA dendrogram from HHPred pairwise alignment scores of related Cas7s. (C) Phylogenetic tree (FastTree) of beta-CASP proteins from both bacteria and archaea, including the β-CASP proteins linked to the candidate type VII system, which form a distinct clade. (D) Top: diagram of the domain architecture of Cas14. Bottom: superposition of Cas14’s C-terminal domain with the Cas10’s C-terminal from PDB: 6NUD showing the Cas10 interface with the target RNA. Both share the 4 helix bundle found in Cas10 and Cas11 that are known to interact with the target strand. (E) CDS target strand preferences of the protospacer matches for the CRISPR array of the experimentally studied Type VII locus. (F) Targets of the protospacer matches for the CRISPR array of the experimentally studied type VII locus. (G) Small RNA-seq of Type VII Cas7-Cas5 RNP pulldown along with the DR sequences. (H) Size exclusion chromatography of the Cas7-Cas5 copurified with an expressed DR + spacer + DR or copurified with an expressed truncated DR + truncated spacer (I) In vitro reconstituted Cas14 and associated effector complex RNP cleavage of Cy5-labeled RNA targets, in the presence or absence of cognate target sequences. (D66A/H67A) represents mutation of key residues in the predicted catalytic Zn(II) binding pocket of Cas14 to alanine.
Fig. 6.
Fig. 6.. Diverse CRISPR systems identified in this study
Genomic loci of identified systems. See Fig. S12–S14 for full set of systems (A) CRISPR-Cas effector modules identified in this study. All enhanced CRISPR association scores are shown below the system name as determined by the pipeline with the numerator indicating the number of CRISPR / divergent DR associated loci and the denominator indicating the effective sample size of the cluster. HNH: Nuclease domain with HNH or HNN catalytic motifs. DinG: Damage Inducible gene G helicase. VRR: PDDEXK nuclease domain. TPR: Tetratricopeptide repeat. MuA: DDE transposase gene associated with Mu transposons. MuB, ATPase gene associated with Mu transposons. CasMuC: Unique gene associated mainly with the CasMu-I system. β-CASP: Metallo-β-lactamase. (B) Novel associations of CRISPR adaptation modules. Enhanced CRISPR association scores shown as in (A). RVT: Reverse Transcriptase. Tfb2: Transcription factor B subunit 2. WYL: domain named after the 3 conserved amino acids in the domain. AEP: archaeo-eukaryotic primase. PrimPol: Primase Polymerase. HTH: Helix-Turn-Helix domain. CHAT: Caspase HetF Associated with TPRs domain. NACHT: predicted nucleoside-triphosphatase (NTPase) domain. vWA: von Willebrand factor type A. HJR: Holliday Junction Resolvase. RDD: domain named after its conserved amino acids. 23S rRNA IVP: 23S rRNA-Intervening Sequence Protein. ThiF: Sulfur carrier protein ThiS adenylyltransferase. HflK: regulator of FtsH protease. GspH: Type II secretion system protein H. FlhB: Flagellar biosynthetic protein. SWIM: Zinc Finger domain. Toprim: topoisomerase-primase domain. (C) CRISPR-linked CARF/SAVED cyclic oligonucleotide binding domain proteins associated with CRISPR arrays. CARF: CRISPR-Associated Rossmann Fold. TIR: Toll/interleukin-1 receptor/resistance protein. RelA: (p)ppGpp synthetase. CYTH: adenylyl cyclase/thiamine triphosphatase. HD: phosphohydrolase. FleQ: transcriptional regulator. SIR2: sirtuin-like domain. vWA-MoxR-VMAP: classical NTP-dependent ternary system involved in conflict systems. TCAD9: Ternary Complex-Associated Domain 9 associated with vWA-MoxR-VMAP. EAD7: Effector-associated domain 7 associated with vWA-MoxR-VMAP. (D) Putative CRISPR auxiliary genes. Enhanced CRISPR association scores shown as in (A). bZIP: Basic Leucine Zipper Domain. CorA: Magnesium transporter. OmpH: outer membrane protein. NurA 5′−3′ exo: DNA double stranded break-repair associated exonuclease. HerA: DNA-repair associated helicase. Y1 Tpase: Y1 tyrosine recombinase. UvrD: helicase. NERD: Nuclease-related Domain. GreB: Transcription elongation factor. NYN: Novel Predicted RNAses with a PIN Domain-Like Fold. ThiS: Sulfur Carrier Protein. Prok-E2: Prokaryotic E2 family A. DarT: thymidine ADP-ribosylation enzyme. DarG: ADP-ribosylation reversal enzyme. ParD: Antitoxin component of the ParDE toxin-antitoxin system. LPD39: Large polyvalent protein-associated domain 39. PLxRFG: domain characteric of some very large proteins in bacteria. (E) General evolutionary mechanisms that likely gave rise to the diverse CRISPR-Cas effector modules identified previously and in this study.
Fig. 6.
Fig. 6.. Diverse CRISPR systems identified in this study
Genomic loci of identified systems. See Fig. S12–S14 for full set of systems (A) CRISPR-Cas effector modules identified in this study. All enhanced CRISPR association scores are shown below the system name as determined by the pipeline with the numerator indicating the number of CRISPR / divergent DR associated loci and the denominator indicating the effective sample size of the cluster. HNH: Nuclease domain with HNH or HNN catalytic motifs. DinG: Damage Inducible gene G helicase. VRR: PDDEXK nuclease domain. TPR: Tetratricopeptide repeat. MuA: DDE transposase gene associated with Mu transposons. MuB, ATPase gene associated with Mu transposons. CasMuC: Unique gene associated mainly with the CasMu-I system. β-CASP: Metallo-β-lactamase. (B) Novel associations of CRISPR adaptation modules. Enhanced CRISPR association scores shown as in (A). RVT: Reverse Transcriptase. Tfb2: Transcription factor B subunit 2. WYL: domain named after the 3 conserved amino acids in the domain. AEP: archaeo-eukaryotic primase. PrimPol: Primase Polymerase. HTH: Helix-Turn-Helix domain. CHAT: Caspase HetF Associated with TPRs domain. NACHT: predicted nucleoside-triphosphatase (NTPase) domain. vWA: von Willebrand factor type A. HJR: Holliday Junction Resolvase. RDD: domain named after its conserved amino acids. 23S rRNA IVP: 23S rRNA-Intervening Sequence Protein. ThiF: Sulfur carrier protein ThiS adenylyltransferase. HflK: regulator of FtsH protease. GspH: Type II secretion system protein H. FlhB: Flagellar biosynthetic protein. SWIM: Zinc Finger domain. Toprim: topoisomerase-primase domain. (C) CRISPR-linked CARF/SAVED cyclic oligonucleotide binding domain proteins associated with CRISPR arrays. CARF: CRISPR-Associated Rossmann Fold. TIR: Toll/interleukin-1 receptor/resistance protein. RelA: (p)ppGpp synthetase. CYTH: adenylyl cyclase/thiamine triphosphatase. HD: phosphohydrolase. FleQ: transcriptional regulator. SIR2: sirtuin-like domain. vWA-MoxR-VMAP: classical NTP-dependent ternary system involved in conflict systems. TCAD9: Ternary Complex-Associated Domain 9 associated with vWA-MoxR-VMAP. EAD7: Effector-associated domain 7 associated with vWA-MoxR-VMAP. (D) Putative CRISPR auxiliary genes. Enhanced CRISPR association scores shown as in (A). bZIP: Basic Leucine Zipper Domain. CorA: Magnesium transporter. OmpH: outer membrane protein. NurA 5′−3′ exo: DNA double stranded break-repair associated exonuclease. HerA: DNA-repair associated helicase. Y1 Tpase: Y1 tyrosine recombinase. UvrD: helicase. NERD: Nuclease-related Domain. GreB: Transcription elongation factor. NYN: Novel Predicted RNAses with a PIN Domain-Like Fold. ThiS: Sulfur Carrier Protein. Prok-E2: Prokaryotic E2 family A. DarT: thymidine ADP-ribosylation enzyme. DarG: ADP-ribosylation reversal enzyme. ParD: Antitoxin component of the ParDE toxin-antitoxin system. LPD39: Large polyvalent protein-associated domain 39. PLxRFG: domain characteric of some very large proteins in bacteria. (E) General evolutionary mechanisms that likely gave rise to the diverse CRISPR-Cas effector modules identified previously and in this study.

References

    1. Wang JY, Doudna JA, CRISPR technology: A decade of genome editing is only the beginning. Science 379, eadd8643 (2023). - PubMed
    1. Shmakov SA, Faure G, Makarova KS, Wolf YI, Severinov KV, Koonin EV, Systematic prediction of functionally linked genes in bacterial and archaeal genomes. Nature Protocols 14 (2019), pp. 3013–3031. - PMC - PubMed
    1. Yan WX, Hunnewell P, Alfonse LE, Carte JM, Keston-Smith E, Sothiselvam S, Garrity AJ, Chong S, Makarova KS, Koonin EV, Cheng DR, Scott DA, Functionally diverse type V CRISPR-Cas systems. Science 363 (2019), pp. 88–91. - PMC - PubMed
    1. Shmakov S, Abudayyeh OO, Makarova KS, Wolf YI, Gootenberg JS, Semenova E, Minakhin L, Joung J, Konermann S, Severinov K, Zhang F, Koonin EV, Discovery and functional characterization of diverse Class 2 CRISPR-Cas systems. Mol. Cell 60, 385 (2015). - PMC - PubMed
    1. Hille F, Richter H, Wong SP, Bratovič M, Ressel S, Charpentier E, The biology of CRISPR-Cas: Backward and forward. Cell 172, 1239–1259 (2018). - PubMed

Substances