Alignment-free viral sequence classification at scale
- PMID: 40251515
- PMCID: PMC12007369
- DOI: 10.1186/s12864-025-11554-5
Alignment-free viral sequence classification at scale
Abstract
Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.
Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.
Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
Keywords: Alignment-free; Biological sequences; Feature extraction; Machine learning; Virus classification.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
Figures




Update of
-
Alignment-Free Viral Sequence Classification at Scale.bioRxiv [Preprint]. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186. bioRxiv. 2024. Update in: BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5. PMID: 39713356 Free PMC article. Updated. Preprint.
Similar articles
-
Alignment-Free Viral Sequence Classification at Scale.bioRxiv [Preprint]. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186. bioRxiv. 2024. Update in: BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5. PMID: 39713356 Free PMC article. Updated. Preprint.
-
GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs.BMC Bioinformatics. 2025 Feb 25;26(1):66. doi: 10.1186/s12859-025-06037-z. BMC Bioinformatics. 2025. PMID: 40000933 Free PMC article.
-
ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y. BMC Genomics. 2019. PMID: 30943897 Free PMC article.
-
Topological Analysis for Sequence Variability: Case Study on more than 2K SARS-CoV-2 sequences of COVID-19 infected 54 countries in comparison with SARS-CoV-1 and MERS-CoV.Infect Genet Evol. 2021 Mar;88:104708. doi: 10.1016/j.meegid.2021.104708. Epub 2021 Jan 6. Infect Genet Evol. 2021. PMID: 33421654 Free PMC article. Review.
-
New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23. Brief Bioinform. 2014. PMID: 24064230 Free PMC article. Review.
Cited by
-
Craft: A Machine Learning Approach to Dengue Subtyping.bioRxiv [Preprint]. 2025 Feb 13:2025.02.10.637410. doi: 10.1101/2025.02.10.637410. bioRxiv. 2025. PMID: 39990353 Free PMC article. Preprint.
References
-
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. 10.1016/S0022-2836(05)80360-2. - PubMed
-
- Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol. 2019;26(6):519–535. PMID: 31050550. 10.1089/cmb.2018.0239. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous