Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 18;25(1):1214.
doi: 10.1186/s12864-024-11135-y.

CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences

Affiliations

CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences

Fatemeh Alipour et al. BMC Genomics. .

Abstract

Background: Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.

Results: This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.

Conclusion: CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.

Keywords: Alignment-free DNA sequence comparison; Chaos Game Representation (CGR); Convolutional neural network; DNA sequence clustering; Data augmentation; Taxonomic classification; Twin contrastive learning; Unsupervised learning.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Frequency Chaos Game Representation (FCGR) at resolution k=8 (for visualization purposes) of a human beta globin region on chromosome 11 of length 73,308 bp (Accession ID: U01317.1); b complete genome of Homo sapiens isolate LI-T1 mitochondrion of length 16,566 bp (Accession ID: KX228192.1); c Escherichia coli plasmid of JE86-ST05 DNA with length 114,953 (Accession ID: AP022816.1); Computer-generated “random” DNA sequences of length 100,000 bp avoiding substrings: d “CG”, e “G”, f “CTA”
Fig. 2
Fig. 2
CGRclust pipeline: Left panel: The process begins with data augmentation to create positive pairs (pairs of mimic sequences, pipeline component 1), followed by the generation of FCGR images of these augmented DNA sequences. Middle Panel: The FCGR images are fed into the backbone model (CNN) for embedding into a latent feature space (pipeline component 2). The twin contrastive learning scheme employs an instance-level contrastive head (ICH) and a cluster-level contrastive head (CCH) to perform contrastive learning at both the instance and the cluster levels (pipeline component 3 and 4, respectively). Right panel: To counteract the inherent variance in CNN training outcomes, a majority voting strategy is applied, aggregating results from multiple CNN models with distinct initializations to finalize cluster assignments for each input DNA sequence
Fig. 3
Fig. 3
Architecture of backbone model designed for clustering FCGR images of DNA sequences. The architecture of the backbone model comprises two convolutional layers, each with a kernel size of 7, stride of 2, and padding of 1. Following each convolutional layer, a Rectified Linear Unit (ReLU) is applied to introduce non-linearity, followed by a batch normalization layer to maintain numerical stability. Next, a max pooling layer with a kernel size of 2 efficiently reduces the spatial dimensions of the feature maps. A flattening layer to transform the multidimensional feature maps into a one-dimensional vector. This is followed by a linear layer, adjusting the output dimension to the desired configuration
Fig. 4
Fig. 4
CGRclust’s evolution of clustering 498 Cypriniformes mitochondrial DNA sequences into three distinct clusters in Test 1. Each data point represents a DNA sequence, and its colour indicates its suborder label, and its position indicates the likelihood of assignment to different clusters (corners). A point at the center of the triangle has an equal probability of being assigned to any of the three clusters, while a point at a corner indicates a definitive association, with probability 1, to that specific corner/cluster. Note that any overlap of colors in the last epoch corresponds to instances of misclustering, where sequences have not been correctly assigned to the ground truth cluster

Similar articles

Cited by

References

    1. Applequist W. A brief review of recent controversies in the taxonomy and nomenclature of Sambucus nigra sensu lato. In: I International Symposium on Elderberry. 2013. pp. 25–33. 10.17660/ActaHortic.2015.1061.1. - PMC - PubMed
    1. Lovich JE, Hart KM. Taxonomy: A history of controversy and uncertainty. Ecol Conserv Diamond-Backed Terrapin. 2018;37–50.
    1. Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994;1(4):337–48. 10.1089/cmb.1994.1.337. - PubMed
    1. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18:1–17. 10.1186/s13059-017-1319-7. - PMC - PubMed
    1. Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acids Res. 1990;18(8):2163–70. 10.1093/nar/18.8.2163. - PMC - PubMed

LinkOut - more resources