. 2025 Apr 18;26(1):389.

doi: 10.1186/s12864-025-11554-5.

Alignment-free viral sequence classification at scale

Daniel J van Zyl^{1

2}, Marcel Dunaiski³, Houriiyah Tegally⁴, Cheryl Baxter^{4

5}, Tulio de Oliveira^{4

5

6

7}, Joicymara S Xavier^{4

8

9}; INFORM Africa research study group

Collaborators, Affiliations

Collaborators

Affiliations

¹ Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa. danielvanzyl@sun.ac.za.
² Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa. danielvanzyl@sun.ac.za.
³ Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
⁴ Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
⁵ Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa.
⁶ Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of Kwazulu-Natal, Durban, South Africa.
⁷ Department of Global Health, University of Washington, Seattle, USA.
⁸ Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil.
⁹ Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil.

PMID: 40251515
PMCID: PMC12007369
DOI: 10.1186/s12864-025-11554-5

Alignment-free viral sequence classification at scale

Daniel J van Zyl et al. BMC Genomics. 2025.

. 2025 Apr 18;26(1):389.

doi: 10.1186/s12864-025-11554-5.

Authors

Daniel J van Zyl^{1

2}, Marcel Dunaiski³, Houriiyah Tegally⁴, Cheryl Baxter^{4

5}, Tulio de Oliveira^{4

5

6

7}, Joicymara S Xavier^{4

8

9}; INFORM Africa research study group

Collaborators

Affiliations

¹ Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa. danielvanzyl@sun.ac.za.
² Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa. danielvanzyl@sun.ac.za.
³ Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
⁴ Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
⁵ Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa.
⁶ Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of Kwazulu-Natal, Durban, South Africa.
⁷ Department of Global Health, University of Washington, Seattle, USA.
⁸ Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil.
⁹ Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil.

PMID: 40251515
PMCID: PMC12007369
DOI: 10.1186/s12864-025-11554-5

Abstract

Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Keywords: Alignment-free; Biological sequences; Feature extraction; Machine learning; Virus classification.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
The SARS-CoV- 2 testing accuracy results of each AF feature extraction method on a class-wise basis. The findings provide insights into the distribution of model performance across classes. Classes are ordered in descending order of average classification accuracy across all models. The standard deviations of the accuracy for each model are also depicted. For the purposes of visual clarity, the values depicted have been smoothed using a sliding window of 50 classes. On the right-hand side, we provide isolated views of the top performing models

**Fig. 2**
Comparison of the performance of each AF feature extraction model on nonrecombinant genomes (blue) and recombinant (orange) genomes. The horizontal lines in each violin plot indicate the models’ achieved accuracy for different individual classes, while the width of the violin plots represents the density of samples at different accuracy levels

**Fig. 3**
This figure depicts the interaction between classification accuracy and the number of training samples for each class (left), the depth of each lineage (middle), and the number of direct descendants of each lineage (right) for each SARS-CoV- 2 model in the form of hexbin plots. We only show depth and number of descendant interactions for nonrecombinant, nonrecombinant sequences

**Fig. 4**
A composite figure of the class-wise classification performance of the top three performing models, FCGR, k-mers, and SWF on the 200 most prominent SARS-CoV- 2 lineages. The inner plot consists of a radar chart, where optimal performance corresponds to observations near the perimeter of the chart. The outer figure shows a circular bar plot in which the bars correspond to the depth of the SARS-CoV- 2 lineages and are colored according to the respective clades of the lineages

See this image and copyright information in PMC

Update of

Alignment-Free Viral Sequence Classification at Scale.
van Zyl DJ, Dunaiski M, Tegally H, Baxter C; INFORM Africa research study group; de Oliveira T, Xavier JS. van Zyl DJ, et al. bioRxiv [Preprint]. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186. bioRxiv. 2024. Update in: BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5. PMID: 39713356 Free PMC article. Updated. Preprint.

References

1. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18(1):186. 10.1186/s13059-017-1319-7. - PMC - PubMed
1. Karlin S, Altschul S. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci. 1990;87(6):2264–8. 10.1073/pnas.87.6.2264. - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. 10.1016/S0022-2836(05)80360-2. - PubMed
1. Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform. 2013 10;15(3):369–375. 10.1093/bib/bbt072. - PMC - PubMed
1. Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol. 2019;26(6):519–535. PMID: 31050550. 10.1089/cmb.2018.0239. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Alignment-free viral sequence classification at scale

Collaborators

Affiliations

Alignment-free viral sequence classification at scale

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous