Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 19;15(1):10024.
doi: 10.1038/s41467-024-54365-0.

Discovering CRISPR-Cas system with self-processing pre-crRNA capability by foundation models

Affiliations

Discovering CRISPR-Cas system with self-processing pre-crRNA capability by foundation models

Wenhui Li et al. Nat Commun. .

Erratum in

Abstract

The discovery of CRISPR-Cas systems has paved the way for advanced gene editing tools. However, traditional Cas discovery methods relying on sequence similarity may miss distant homologs and aren't suitable for functional recognition. With protein large language models (LLMs) evolving, there is potential for Cas system modeling without extensive training data. Here, we introduce CHOOSER (Cas HOmlog Observing and SElf-processing scReening), an AI framework for alignment-free discovery of CRISPR-Cas systems with self-processing pre-crRNA capability using protein foundation models. By using CHOOSER, we identify 11 Casλ homologs, nearly doubling the known catalog. Notably, one homolog, EphcCasλ, is experimentally validated for self-processing pre-crRNA, DNA cleavage, and trans-cleavage, showing promise for CRISPR-based pathogen detection. This study highlights an innovative approach for discovering CRISPR-Cas systems with specific functions, emphasizing their potential in gene editing.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Schematic diagram of the CHOOSER framework for identifying and functional screening of CRISPR-Cas systems with self-processing pre-crRNA capability.
Spring-like symbols indicate CRISPR arrays, while arrowed rectangles indicate ORFs (Cas9 proteins are colored orange, Cas12 proteins are colored blue, Cas13 proteins are colored green, other proteins are colored gray, suspected Cas proteins are colored salmon, and untyped proteins are uncolored.) In the mini-CRISPR array, diamonds indicate directed repeats (DR), and colored oval squares indicate different spacers. In DNA cleavage testing, colored rectangles denote various PAM motifs. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Model trained for Cas single-effector discovery.
a Prokaryotic-origin Cas single effectors and other background proteins used as training and validation datasets. b Viral-origin Cas single effectors and other viral proteins used as a testing dataset. Source data are provided as a Source Data file. c Schematic of the fine-tuned ESM-2 model for discovering Cas single effectors. d Adjustment of hyperparameters for the focal loss to enhance the performance of the classification model. e, f Data distributions for prokaryotic-origin (e) and viral-origin (f) datasets visualized using the representations extracted by the ESM-2 models before and after fine-tuning. g Suspected Cas homologs identified by our fine-tuned model. Orange symbols with crosses represent Cas9 proteins, blue symbols with X’s represent Cas12 proteins, green symbols with squares represent Cas13 proteins and gray symbols with circles represent other proteins. Pie charts with blue outlines indicate proteins of prokaryotic origin, while those with yellow outlines indicate proteins of viral origin.
Fig. 3
Fig. 3. Model trained for predicting Cas12 enzymes capable of self-processing pre-crRNA.
ac Curves of identity between Cα distance maps from cryo-EM structures and their corresponding predicted protein structures, along a range of thresholds θ: (a) FnCas12a (PDB ID: 6I1K); (b) Casλ (PDB ID: 8DC2); and (c) Casπ (PDB ID: 7YOJ). d Performance of the models in predicting the ability of Cas12 proteins to self-process their pre-crRNA, denoted by F1 scores in the validation (in gray) and testing (in red) datasets. Source data are provided as a Source Data file. e Assorted representations from protein sequences to structures throughout the ESMfold protein structure prediction process. f F1 scores on the testing dataset, illustrating the performance of the models in predicting the ability of Cas12 protein self-processing pre-crRNA, using varied representations. The heatmap, using a color scale from blue to red, displays F1 scores ranging from low to high. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Analysis of 39 Cas12 candidates.
a Phylogenetic tree of all the suspected Cas12 enzymes against the background of the CasPEDIA type V Cas12 dataset. Source data are provided as Source Data files. b Genomic arrangements of representatives of the five clade candidates. Spring-like symbols indicate CRISPR arrays, while arrowed rectangles represent ORFs (suspected Cas12 proteins are colored salmon, and other proteins are colored gray). c Clade 1 candidates shown phylogenetically against the background of Cas12a homologs. Source data are provided as Source Data files. d Clade 2 candidates shown phylogenetically against the background of Cas12m homologs. e MSAs for the direct repeats of Clade 3 suspected CRISPR systems. MSAs are visualized using SnapGene, with a color scale from blue to red representing MSA identity, ranging from low to high. f Structural alignment showing the RuvC-like domain and canonical D-E-D active site residues within the domain of a putative Casλ homolog in Clade 3. A blue rectangle represents a Casλ candidate protein, while the yellow blocks denote the RuvC-like regions of the protein that align with known RuvC proteins.
Fig. 5
Fig. 5. Biochemical characterization of Casλ homologs.
a Structure of pre-crRNA substrates consists of a hairpin formed by a direct repeat (DR) sequence followed by a 20 bp spacer. Diamonds symbols indicate directed repeats (DR), and colored oval squares indicate different spacers. b Representative gel of Casλ-mediated pre-crRNA cleavage by six Casλ homologs after 60 min incubation with 5′-FAM labeled pre-crRNA substrates. c Pipeline used to detect dsDNA cleavage and associated PAM recognition by in vitro DNA cleavage assay. Casλ RNP complexes cleave a 5′ PAM library PCR product in vitro, and the uncut part was captured via PCR and subjected to Illumina deep sequencing. Gray rectangles indicate 5′ PAM library PCR products, with blue blocks representing spacers and colored short blocks denoting various PAM motifs. Dark green and dark blue squares represent different barcodes on the sequencing adapters. d Six Casλ homologs cleaved dsDNA in vitro at 37 °C for 1 h. A 500 bp PCR product was cleaved into two 250 bp products. e Analysis of Illumina deep sequencing data showing that the presumed PAM of EphcCasλ was TTR. The weblogo of the presumed PAM that supported target recognition and cleavage was generated using WebLogo (Thymine is colored red, Adenine green, Cytosine blue, and Guanine yellow). Source data are provided as a Source Data file. f EphcCasλ and Casλ1 cleaved TTA/TTG dsDNA in vitro at 37 °C for 3 h. PAM was confirmed to be TTA/TTG. g Trans-cleavage assay conducted with the ssDNA-FAM/BHQ reporter. This reporter is a ssDNA oligonucleotide labeled with a fluorophore (FAM) at one end and a quencher (BHQ) at the other. Initially, fluorescence from FAM is quenched by BHQ due to their proximity (top left). Upon recognizing and binding to a target DNA sequence, the Casλ nuclease becomes activated and can non-specifically cleave nearby ssDNA, including the reporter. As a result, the fluorophore (FAM) is separated from the quencher (BHQ), leading to the emission of fluorescence (bottom right). The activated Casλ is shown in pink, the target DNA sequence in black, the PAM motif in yellow, and the guide RNA in blue. h EphcCasλ exhibited trans-cleavage of DNA at 30 °C, 37 °C, and 44 °C when the PAM was TTA/G. The experiments are shown in (b, d, f) are representative of three independent experiments with similar results.

References

    1. Koonin, E. V. & Makarova, K. S. Origins and evolution of CRISPR-Cas systems. Philos. Trans. R. Soc. B374, 20180087 (2019). - PMC - PubMed
    1. Wang, J. Y. & Doudna, J. A. CRISPR technology: a decade of genome editing is only the beginning. Science379, eadd8643 (2023). - PubMed
    1. Shmakov, S. et al. Diversity and evolution of class 2 CRISPR–Cas systems. Nat. Rev. Microbiol15, 169–182 (2017). - PMC - PubMed
    1. Burstein, D. et al. New CRISPR–Cas systems from uncultivated microbes. Nature542, 237–241 (2017). - PMC - PubMed
    1. Harrington, L. B. et al. Programmed DNA destruction by miniature CRISPR-Cas14 enzymes. Science362, 839–842 (2018). - PMC - PubMed

Publication types

MeSH terms

Substances

Associated data

LinkOut - more resources