Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 22;26(1):bbae680.
doi: 10.1093/bib/bbae680.

kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes

Affiliations

kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes

Giuseppe Defazio et al. Brief Bioinform. .

Abstract

The advent of high-throughput sequencing (HTS) technologies unlocked the complexity of the microbial world through the development of metagenomics, which now provides an unprecedented and comprehensive overview of its taxonomic and functional contribution in a huge variety of macro- and micro-ecosystems. In particular, shotgun metagenomics allows the reconstruction of microbial genomes, through the assembly of reads into MAGs (metagenome-assembled genomes). In fact, MAGs represent an information-rich proxy for inferring the taxonomic composition and the functional contribution of microbiomes, even if the relevant analytical approaches are not trivial and still improvable. In this regard, tools like CAMITAX and GTDBtk have implemented complex approaches, relying on marker gene identification and sequence alignments, requiring a large processing time. With the aim of deploying an effective tool for fast and reliable MAG taxonomic classification, we present here kMetaShot, a taxonomy classifier based on k-mer/minimizer counting. We benchmarked kMetaShot against CAMITAX and GTDBtk by using both in silico and real mock communities and demonstrated how, while implementing a fast and concise algorithm, it outperforms the other tools in terms of classification accuracy. Additionally, kMetaShot is an easy-to-install and easy-to-use bioinformatic tool that is also suitable for researchers with few command-line skills. It is available and documented at https://github.com/gdefazio/kMetaShot.

Keywords: k-mer; minimizer; shotgun metagenomics; taxonomic classification.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart representing the kMetaShot workflow. In particular, left (green) and central (yellow) boxes refer to reference generation and classification modules, respectively. Both modules rely on k-mer/minimizer counting. The reference generation module aims to identify and store nonredundant minimizers in the storage matrix. The classification module compares the minimizer found in MAGs to those in the storage matrix. Finally, the right (light-blue) box refers to a general workflow for MAG generation.
Figure 2
Figure 2
Example of minimizer computation. The nucleotide sequence is firstly k-mer-counted. Each k-mer is decomposed in n-mers, and the n-mer that minimizes the lexicographical order is selected as a minimizer. The lexicographical order is A < C < G < T. Selected minimizer sequences are coloured in orange. k = 5, n = 3.
Figure 3
Figure 3
Graphical representation of the storage matrix. Black and white slots represent redundant minimizers (i.e. those shared among different genera) and empty ones, respectively. Filled slots refer to the relevant minimizer. For instance, the slot identified by row 229-2, and Column 3 is associated with taxid 1883 (Streptomyces).
Figure 4
Figure 4
kMetaShot classification performances on HMP genomes according to different ass2ref thresholds and stratified per taxonomic rank (i.e. strain, species, and genus levels). (A) Stacked barplot representing the number of assigned and correctly assigned genomes. (B) Barplot representing the observed specificity, F1-score, and BA according to ass2ref. (C) Number of species and strains not represented in the kMetaShot reference and not correctly classified. (D) Barplot representing the observed FPR according to increasing ass2ref and stratified per taxonomic rank.
Figure 5
Figure 5
Boxplots representing the sensitivity, precision, FPR, BA, and F1-score distribution measured for the benchmarked tools for both PacBio- and Illumina-simulated reads in the CAMI2 GI data. Data are stratified according to the analysed taxonomic rank (i.e. genus, species, and strain).
Figure 6
Figure 6
Evaluation of the computational resources required to address the classification task. In particular, RAM and CPU consumption were measured in hours. All tools were tested by using 10 CPUs. In (A), (C), and (E), the CPU load, memory engagement, and AUC for memory usage are shown, respectively, for Air-Illumina Sample12. (B), (D), and (F) show the same data for Air-PacBio Sample4.

References

    1. Barton L, Northup DE. Microb Ecol Wiley‐Blackwell, 2011. 10.1002/9781118015841. - DOI
    1. Berg G, Rybakova D, Fischer D. et al. Microbiome definition re-visited: Old concepts and new challenges. Microbiome 2020;8:103. 10.1186/s40168-020-00875-0. - DOI - PMC - PubMed
    1. Blevins SM, Bronze MS. Robert Koch and the ‘golden age’ of bacteriology. Int J Infect Dis 2010;14:e744–51. 10.1016/j.ijid.2009.12.003. - DOI - PubMed
    1. Bassler BL. Small talk: Cell-to-cell communication in bacteria. Cell 2002;109:421–4. 10.1016/S0092-8674(02)00749-3. - DOI - PubMed
    1. Metchnikoff E. The prolongation of life: optimistic studies. https://www.gutenberg.org/files/51521/51521-h/51521-h.htm.