CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences
- PMID: 39695938
- PMCID: PMC11657719
- DOI: 10.1186/s12864-024-11135-y
CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences
Abstract
Background: Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.
Results: This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.
Conclusion: CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.
Keywords: Alignment-free DNA sequence comparison; Chaos Game Representation (CGR); Convolutional neural network; DNA sequence clustering; Data augmentation; Taxonomic classification; Twin contrastive learning; Unsupervised learning.
© 2024. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
Figures




Similar articles
-
DeLUCS: Deep learning for unsupervised clustering of DNA sequences.PLoS One. 2022 Jan 21;17(1):e0261531. doi: 10.1371/journal.pone.0261531. eCollection 2022. PLoS One. 2022. PMID: 35061715 Free PMC article.
-
iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences.Bioinformatics. 2023 Sep 2;39(9):btad508. doi: 10.1093/bioinformatics/btad508. Bioinformatics. 2023. PMID: 37589603 Free PMC article.
-
MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0. BMC Genomics. 2022. PMID: 35668366 Free PMC article.
-
Chaos game representation and its applications in bioinformatics.Comput Struct Biotechnol J. 2021 Nov 10;19:6263-6271. doi: 10.1016/j.csbj.2021.11.008. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 34900136 Free PMC article. Review.
-
A review of neural networks for metagenomic binning.Brief Bioinform. 2025 Mar 4;26(2):bbaf065. doi: 10.1093/bib/bbaf065. Brief Bioinform. 2025. PMID: 40131312 Free PMC article. Review.
Cited by
-
Positional frequency chaos game representation for machine learning-based classification of crop lncRNAs.bioRxiv [Preprint]. 2025 Jun 7:2025.06.03.657533. doi: 10.1101/2025.06.03.657533. bioRxiv. 2025. PMID: 40501556 Free PMC article. Preprint.
References
-
- Lovich JE, Hart KM. Taxonomy: A history of controversy and uncertainty. Ecol Conserv Diamond-Backed Terrapin. 2018;37–50.
-
- Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994;1(4):337–48. 10.1089/cmb.1994.1.337. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources