Improving taxonomic classification with feature space balancing
- PMID: 37577265
- PMCID: PMC10415173
- DOI: 10.1093/bioadv/vbad092
Improving taxonomic classification with feature space balancing
Abstract
Summary: Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use k-mer profiles of DNA sequences as features for taxonomic classification. Although k-mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision.
Availability and implementation: The open-source code and the code to reproduce the results is available in Seafile, at https://tinyurl.com/ysk47fmr.
Supplementary information: Supplementary data are available at Bioinformatics Advances online.
© The Author(s) 2023. Published by Oxford University Press.
Conflict of interest statement
The authors have no conflicts of interest to declare. All authors have seen and agree with the contents of the manuscript and there is no financial interest to report.
Figures




Similar articles
-
Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.Proc Natl Acad Sci U S A. 2022 Aug 30;119(35):e2122636119. doi: 10.1073/pnas.2122636119. Epub 2022 Aug 26. Proc Natl Acad Sci U S A. 2022. PMID: 36018838 Free PMC article.
-
Higher-order Markov models for metagenomic sequence classification.Bioinformatics. 2020 Aug 15;36(14):4130-4136. doi: 10.1093/bioinformatics/btaa562. Bioinformatics. 2020. PMID: 32516355
-
Fast and sensitive taxonomic assignment to metagenomic contigs.Bioinformatics. 2021 Sep 29;37(18):3029-3031. doi: 10.1093/bioinformatics/btab184. Bioinformatics. 2021. PMID: 33734313 Free PMC article.
-
Large-scale machine learning for metagenomics sequence classification.Bioinformatics. 2016 Apr 1;32(7):1023-32. doi: 10.1093/bioinformatics/btv683. Epub 2015 Nov 20. Bioinformatics. 2016. PMID: 26589281 Free PMC article.
-
Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding.Bioinformatics. 2017 Jul 15;33(14):i92-i101. doi: 10.1093/bioinformatics/btx234. Bioinformatics. 2017. PMID: 28881969 Free PMC article.
Cited by
-
Taxometer: Improving taxonomic classification of metagenomics contigs.Nat Commun. 2024 Sep 27;15(1):8357. doi: 10.1038/s41467-024-52771-y. Nat Commun. 2024. PMID: 39333501 Free PMC article.
-
PCVR: a pre-trained contextualized visual representation for DNA sequence classification.BMC Bioinformatics. 2025 May 9;26(1):125. doi: 10.1186/s12859-025-06136-x. BMC Bioinformatics. 2025. PMID: 40346458 Free PMC article.
References
LinkOut - more resources
Full Text Sources
Miscellaneous