Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Dec 11:2024.12.10.627186.
doi: 10.1101/2024.12.10.627186.

Alignment-Free Viral Sequence Classification at Scale

Affiliations

Alignment-Free Viral Sequence Classification at Scale

Daniel J van Zyl et al. bioRxiv. .

Update in

  • Alignment-free viral sequence classification at scale.
    van Zyl DJ, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier JS; INFORM Africa research study group. van Zyl DJ, et al. BMC Genomics. 2025 Apr 18;26(1):389. doi: 10.1186/s12864-025-11554-5. BMC Genomics. 2025. PMID: 40251515 Free PMC article.

Abstract

Background: The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

Results: We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

Conclusion: Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Keywords: alignment-free; biological sequences; feature extraction; machine learning; virus classification.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The SARS-CoV-2 testing accuracy results of each AF feature extraction method on a class-wise basis. The findings provide insights into the distribution of model performance across classes. Classes are ordered in descending order of average classification accuracy across all models. The standard deviations of the accuracy for each model are also depicted. For the purposes of visual clarity, the values depicted have been smoothed using a sliding window of 50 classes. On the right-hand side, we provide isolated views of the top performing models.
Fig. 2
Fig. 2
Comparison of the performance of each AF feature extraction model on nonrecombinant genomes (blue) and recombinant (orange) genomes. The horizontal lines in each violin plot indicate the models’ achieved accuracy for different individual classes, while the width of the violin plots represents the density of samples at different accuracy levels.
Fig. 3
Fig. 3
This figure depicts the interaction between classification accuracy and the number of training samples for each class (left), the depth of each lineage (middle), and the number of direct descendants of each lineage (right) for each SARS-CoV-2 model in the form of hexbin plots. We only show depth and number of descendant interactions for nonrecombinant, nonrecombinant sequences.
Fig. 4
Fig. 4
A composite figure of the class-wise classification performance of the top three performing models, FCGR, k-mers, and SWF on the 200 most promintent SARS-CoV-2 lineages. The inner plot consists of a radar chart, where optimal performance corresponds to observations near the perimeter of the chart. The outer figure shows a circular bar plot in which the bars correspond to the depth of the SARS-CoV-2 lineages and are colored according to the respective clades of the lineages.

Similar articles

References

    1. Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology. 2017;18(1):186. 10.1186/s13059-017-1319-7. - DOI - PMC - PubMed
    1. Karlin S, Altschul S. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences. 1990;87(6):2264–2268. 10.1073/pnas.87.6.2264. - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–410. https://doi.org/ 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. Almeida JS. Sequence analysis by iterated maps, a review. Briefings in Bioinformatics. 2013. October;15(3):369–375. 10.1093/bib/bbt072. https://academic.oup.com/bib/article-pdf/15/3/369/450331/bbt072.pdf. - DOI - PMC - PubMed
    1. Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. Journal of Computational Biology. 2019;26(6):519–535. https://doi.org/10.1089/cmb.2018.0239. https://doi.org/10.1089/cmb.2018.0239. - DOI - DOI - PubMed

Publication types