This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Dec 11:2024.12.10.627186.

doi: 10.1101/2024.12.10.627186.

Alignment-Free Viral Sequence Classification at Scale

Daniel J van Zyl^{1

2}, Marcel Dunaiski², Houriiyah Tegally¹, Cheryl Baxter^{1

3}; INFORM Africa research study group; Tulio de Oliveira^{1

3

4

5}, Joicymara S Xavier^{1

6

7}

Affiliations

¹ Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
² Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
³ Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa.
⁴ KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa.
⁵ Department of Global Health, University of Washington, Seattle, USA.
⁶ Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil.
⁷ Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil.

PMID: 39713356
PMCID: PMC11661207
DOI: 10.1101/2024.12.10.627186

Alignment-Free Viral Sequence Classification at Scale

Daniel J van Zyl et al. bioRxiv. 2024.

[Preprint]. 2024 Dec 11:2024.12.10.627186.

doi: 10.1101/2024.12.10.627186.

Authors

Daniel J van Zyl^{1

2}, Marcel Dunaiski², Houriiyah Tegally¹, Cheryl Baxter^{1

3}; INFORM Africa research study group; Tulio de Oliveira^{1

3

4

5}, Joicymara S Xavier^{1

6

7}

Affiliations

¹ Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
² Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
³ Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa.
⁴ KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa.
⁵ Department of Global Health, University of Washington, Seattle, USA.
⁶ Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil.
⁷ Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil.

PMID: 39713356
PMCID: PMC11661207
DOI: 10.1101/2024.12.10.627186

Update in

Alignment-free viral sequence classification at scale.
van Zyl DJ, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier JS; INFORM Africa research study group. van Zyl DJ, et al. BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5. BMC Genomics. 2025. PMID: 40251515 Free PMC article.

Abstract

Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Keywords: alignment-free; biological sequences; feature extraction; machine learning; virus classification.

PubMed Disclaimer

Figures

**Fig. 1**
The SARS-CoV-2 testing accuracy results of each AF feature extraction method on a class-wise basis. The findings provide insights into the distribution of model performance across classes. Classes are ordered in descending order of average classification accuracy across all models. The standard deviations of the accuracy for each model are also depicted. For the purposes of visual clarity, the values depicted have been smoothed using a sliding window of 50 classes. On the right-hand side, we provide isolated views of the top performing models.

**Fig. 2**
Comparison of the performance of each AF feature extraction model on nonrecombinant genomes (blue) and recombinant (orange) genomes. The horizontal lines in each violin plot indicate the models’ achieved accuracy for different individual classes, while the width of the violin plots represents the density of samples at different accuracy levels.

**Fig. 3**
This figure depicts the interaction between classification accuracy and the number of training samples for each class (left), the depth of each lineage (middle), and the number of direct descendants of each lineage (right) for each SARS-CoV-2 model in the form of hexbin plots. We only show depth and number of descendant interactions for nonrecombinant, nonrecombinant sequences.

**Fig. 4**
A composite figure of the class-wise classification performance of the top three performing models, FCGR, k-mers, and SWF on the 200 most promintent SARS-CoV-2 lineages. The inner plot consists of a radar chart, where optimal performance corresponds to observations near the perimeter of the chart. The outer figure shows a circular bar plot in which the bars correspond to the depth of the SARS-CoV-2 lineages and are colored according to the respective clades of the lineages.

See this image and copyright information in PMC

References

1. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology. 2017;18(1):186. 10.1186/s13059-017-1319-7. - DOI - PMC - PubMed
1. Karlin S, Altschul S. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences. 1990;87(6):2264–2268. 10.1073/pnas.87.6.2264. - DOI - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. https://doi.org/ 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
1. Almeida JS. Sequence analysis by iterated maps, a review. Briefings in Bioinformatics. 2013. October;15(3):369–375. 10.1093/bib/bbt072. https://academic.oup.com/bib/article-pdf/15/3/369/450331/bbt072.pdf. - DOI - PMC - PubMed
1. Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. Journal of Computational Biology. 2019;26(6):519–535. https://doi.org/10.1089/cmb.2018.0239. https://doi.org/10.1089/cmb.2018.0239. - DOI - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Alignment-Free Viral Sequence Classification at Scale

Affiliations

Alignment-Free Viral Sequence Classification at Scale

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous