Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 1;40(Suppl 2):ii165-ii173.
doi: 10.1093/bioinformatics/btae397.

Metagenomic functional profiling: to sketch or not to sketch?

Affiliations

Metagenomic functional profiling: to sketch or not to sketch?

Mahmudur Rahman Hera et al. Bioinformatics. .

Abstract

Motivation: Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k-mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k-mer-sketching algorithm, to obtain functional profiles of metagenome samples.

Results: We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99× faster in wall-clock time, and consumes up to 40-55× less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets.

Availability and implementation: This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: https://github.com/KoslickiLab/fmh-funprofiler. All scripts of the analyses we present in this manuscript can be found on GitHub.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
fmh-funprofiler identifies orthologs in a metagenome by splitting orthologs into k-mers and creating FracMinHash sketches with sourmash sketch. It then sketches the input metagenome file in the same way and computes the sample composition using the orthologs’ k-mers with sourmash prefetch. Finally, it post-processes the output to compute the abundances.
Figure 2.
Figure 2.
Average performances in identifying KOs in simulated metagenomes. The metagenomes are of size 6.4 Gb and consist of 6.1–6.9 thousand KOs. Metrics shown: (a) purity, (b) completeness in all KOs, (c) completeness in top 95% abundant KOs, (d) weighted Jaccard index between ground truth KOs and identified KOs, (e) Pearson correlation coefficient of relative abundances computed by the tools and the ground truth, and (f) Bray–Curtis distance of the relative abundances computed by the tools and the ground truth. Every point shows an average over 30 random seeds (30 different simulations), and error bars indicate 1 SD.
Figure 3.
Figure 3.
Computational resources consumed by DIAMOND and fmh-funprofiler in identifying KOs in 30 simulated metagenomes: (a) total wall-clock time and (b) peak memory usage. The metagenomes are the same ones used to generate Figure 2. DIAMOND was run using 128 threads, and fmh-funprofiler using only 30 (one for each input). We found that fmh-funprofiler runs 39–99× faster, and uses 40–55× less memory compared to DIAMOND.
Figure 4.
Figure 4.
Differential analysis for T2D versus HSS. (a) Top 10 differential KOs in T2D samples compared to healthy samples. (b) Top 10 differential KEGG pathways in T2D samples compared to healthy samples.

References

    1. Ağagündüz D, Icer MA, Yesildemir O. et al. The roles of dietary lipids and lipidomics in gut-brain axis in type 2 diabetes mellitus. J Transl Med 2023;21:240. - PMC - PubMed
    1. Aramaki T, Blanc-Mathieu R, Endo H. et al. Kofamkoala: KEGG ortholog assignment based on profile hmm and adaptive score threshold. Bioinformatics 2020;36:2251–2. - PMC - PubMed
    1. Arnaud MB, Cerqueira GC, Inglis DO. et al. The aspergillus genome database (ASPGD): recent developments in comprehensive multispecies curation, comparative genomics and community resources. Nucleic Acids Res 2012;40:D653–9. - PMC - PubMed
    1. Blanca A, Harris RS, Koslicki D. et al. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. J Comput Biol 2022;29:155–68. - PMC - PubMed
    1. Broder AZ. On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of SEQUENCES, Salerno, Italy, 1997 (Cat. No. 97TB100171). IEEE, 1997, 21–9.

Publication types

LinkOut - more resources