Alignment-free viral sequence classification at scale
- PMID: 40251515
- PMCID: PMC12007369
- DOI: 10.1186/s12864-025-11554-5
Alignment-free viral sequence classification at scale
Abstract
Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.
Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.
Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
Keywords: Alignment-free; Biological sequences; Feature extraction; Machine learning; Virus classification.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
Figures




Update of
-
Alignment-Free Viral Sequence Classification at Scale.bioRxiv [Preprint]. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186. bioRxiv. 2024. Update in: BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5. PMID: 39713356 Free PMC article. Updated. Preprint.
References
-
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. 10.1016/S0022-2836(05)80360-2. - PubMed
-
- Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol. 2019;26(6):519–535. PMID: 31050550. 10.1089/cmb.2018.0239. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous