Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jan 23:2024.01.22.576286.
doi: 10.1101/2024.01.22.576286.

Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum

Affiliations

Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum

Sita Sirisha Madugula et al. bioRxiv. .

Update in

Abstract

The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations like large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In the current study, we aim to elucidate the unique protein attributes associated with Cas9 and Cas12 families and identify the features that distinguish each family from the other. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,495 features) encoding various physiochemical, topological, constitutional, and coevolutionary information of Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and Non-Cas proteins. All the models were evaluated rigorously on the test and independent datasets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 95% and 97% on their respective independent datasets, while the multiclass classifier achieved a high F1 score of 0.97. We observed that Quasi-sequence-order descriptors like Schneider-lag descriptors and Composition descriptors like charge, volume, and polarizability are essential for the Cas12 family. More interestingly, we discovered that Amino Acid Composition descriptors, especially the Tripeptide Composition (TPC) descriptors, are important for the Cas9 family. Four of the identified important descriptors of Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all the Cas9 proteins and were located within different catalytically important domains of the Cas9 protein structure. Among these four tripeptides, tripeptides DHI and HHA are well-known to be involved in the DNA cleavage activity of the Cas9 protein. We therefore propose the the other two tripeptides, PWN and PYY, may also be essential for the Cas9 family. Our identified important descriptors enhanced the understanding of the catalytic mechanisms of Cas9 and Cas12 proteins and provide valuable insights into design of novel Cas systems to achieve enhanced gene-editing properties.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1
Workflow of the study. (a) Overview of the study depicting the important steps of feature generation, feature selection followed by classification. The important features characteristic of the Cas9 and Cas12 families are further analyzed. (b) The detailed steps involved of our ML pipeline. Protein features calculated in the feature generation step are subjected to a 5-fold BorutaSHAP feature selection method to identify the important features. Intersection of accepted features across all 5 folds is taken to get the intersection features. Similar procedure is repeated across All 5 folds by taking union to obtain union features. The important intersection features are further used to develop RF classifiers which are then used to make predictions on the independent set.
Figure 2
Figure 2
Summary of the important features in CAS12 vs. Non-CAS classification. (a) Descriptor set wise distribution of important features identified in Cas12 vs. Non-Cas RF model. X axis represents the descriptor groups to which the important features belong to and y-axis represents a count of the important descriptors within the said descriptor groups (b) Top 10 descriptors of Cas12 models.
Figure 3
Figure 3
Summary of the important features in CAS9 vs. Non-CAS classification (a) Descriptor set wise distribution of important features identified in Cas9 RF model. X-axis represents the descriptor groups to which the important features belong and y-axis represents a count of the important descriptors within the said descriptor groups (b) Top 10 descriptors of the Cas9 classification.
Figure 4
Figure 4
Plot of descriptor values of the four important TPC descriptors (a) PWN (b) HHA (c) DHI (d) PYY within the Cas and Non-Cas fractions of our Cas9 training dataset.
Figure 5
Figure 5
Comparison of prevalence of the four tripeptides identified in Cas9 important features across the two Cas families.
Figure 6
Figure 6
Locations of the four identified tripeptides within the different domains of Cas9 crystal structure [PDB ID: 5F9R]. The tripeptides PWN (475–477) and PYY (449–451) are located in RECIII domain (beigh), while HHA (982–984) and DHI (839–841) are located in RuvC (blue) and HNH (pink) domains of the Cas9 crystal structure.
Figure 7
Figure 7
Summary of the important features identified in Cas9 vs. Cas12 vs. Non-Cas multiclass classification. (a) Descriptor group-wise distribution of important features identified in multiclass RF model. X-axis represents the descriptor groups to which the important features belong and y-axis represents count of the important descriptors within the said descriptor groups (b) Top 10 descriptors of the Cas9 classification.
Figure 8.
Figure 8.
Distribution of the g3.g6.g3 descriptor in the multiclass training dataset
Figure 9.
Figure 9.
Group-wise distribution of the important features of the three classification models.

References

    1. Rath D., Amlinger L., Rath A. & Lundgren M. The CRISPR-Cas immune system: Biology, mechanisms and applications. Biochimie 2015, 117, 119–128 - PubMed
    1. Hale C. R. et al. RNA-Guided RNA Cleavage by a CRISPR RNA-Cas Protein Complex. Cell 2009, 139, 945–956. - PMC - PubMed
    1. Barrangou R. et al. CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes. Science 2007, 315, 1709–1712. - PubMed
    1. Bolotin A., Quinquis B., Sorokin A. & Ehrlich S. D. Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 2005, 151, 2551–2561. - PubMed
    1. Koonin E. V., Makarova K. S. & Zhang F. Diversity, classification, and evolution of CRISPR-Cas systems. Curr. Opin. Microbiol. 2017, 37, 67–78. - PMC - PubMed

Publication types