Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 18;26(1):389.
doi: 10.1186/s12864-025-11554-5.

Alignment-free viral sequence classification at scale

Collaborators, Affiliations

Alignment-free viral sequence classification at scale

Daniel J van Zyl et al. BMC Genomics. .

Abstract

Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Keywords: Alignment-free; Biological sequences; Feature extraction; Machine learning; Virus classification.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The SARS-CoV- 2 testing accuracy results of each AF feature extraction method on a class-wise basis. The findings provide insights into the distribution of model performance across classes. Classes are ordered in descending order of average classification accuracy across all models. The standard deviations of the accuracy for each model are also depicted. For the purposes of visual clarity, the values depicted have been smoothed using a sliding window of 50 classes. On the right-hand side, we provide isolated views of the top performing models
Fig. 2
Fig. 2
Comparison of the performance of each AF feature extraction model on nonrecombinant genomes (blue) and recombinant (orange) genomes. The horizontal lines in each violin plot indicate the models’ achieved accuracy for different individual classes, while the width of the violin plots represents the density of samples at different accuracy levels
Fig. 3
Fig. 3
This figure depicts the interaction between classification accuracy and the number of training samples for each class (left), the depth of each lineage (middle), and the number of direct descendants of each lineage (right) for each SARS-CoV- 2 model in the form of hexbin plots. We only show depth and number of descendant interactions for nonrecombinant, nonrecombinant sequences
Fig. 4
Fig. 4
A composite figure of the class-wise classification performance of the top three performing models, FCGR, k-mers, and SWF on the 200 most prominent SARS-CoV- 2 lineages. The inner plot consists of a radar chart, where optimal performance corresponds to observations near the perimeter of the chart. The outer figure shows a circular bar plot in which the bars correspond to the depth of the SARS-CoV- 2 lineages and are colored according to the respective clades of the lineages

Update of

Similar articles

Cited by

  • Craft: A Machine Learning Approach to Dengue Subtyping.
    van Zyl DJ, Dunaiski M, Tegally H, Baxter C; INFORM Africa research study group; de Oliveira T, Xavier JS. van Zyl DJ, et al. bioRxiv [Preprint]. 2025 Feb 13:2025.02.10.637410. doi: 10.1101/2025.02.10.637410. bioRxiv. 2025. PMID: 39990353 Free PMC article. Preprint.

References

    1. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18(1):186. 10.1186/s13059-017-1319-7. - PMC - PubMed
    1. Karlin S, Altschul S. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci. 1990;87(6):2264–8. 10.1073/pnas.87.6.2264. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. 10.1016/S0022-2836(05)80360-2. - PubMed
    1. Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform. 2013 10;15(3):369–375. 10.1093/bib/bbt072. - PMC - PubMed
    1. Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol. 2019;26(6):519–535. PMID: 31050550. 10.1089/cmb.2018.0239. - PubMed

LinkOut - more resources