Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022;7(80):4908.
doi: 10.21105/joss.04908. Epub 2022 Dec 28.

Metagenomic classification with KrakenUniq on low-memory computers

Affiliations

Metagenomic classification with KrakenUniq on low-memory computers

Christopher Pockrandt et al. J Open Source Softw. 2022.

Abstract

Kraken and KrakenUniq are widely-used tools for classifying metagenomics sequences. A key requirement for these systems is a database containing all k-mers from all genomes that the users want to be able to detect, where k = 31 by default. This database can be very large, easily exceeding 100 gigabytes (GB) and sometimes 400 GB. Previously, Kraken and KrakenUniq required loading the entire database into main memory (RAM), and if RAM was insufficient, they used memory mapping, which significantly increased the running time for large datasets. We have implemented a new algorithm in KrakenUniq that allows it to load and process the database in chunks, with only a modest increase in running time. This enhancement now makes it feasible to run KrakenUniq on very large datasets and huge databases on virtually any computer, even a laptop, while providing the same very high classification accuracy as the previous system.

Statement of need: The KrakenUniq software classifies reads from metagenomic samples to establish which organisms are present in the samples and estimate their abundance. The software is widely used used by researchers and clinicians in medical diagnostics, microbiome and environmental studies.Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+.

PubMed Disclaimer

References

    1. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, & Sayers EW (2012). GenBank. Nucleic Acids Research, 41(D1), D36–D42. 10.1093/nar/gks1195 - DOI - PMC - PubMed
    1. Breitwieser FP, Baker DN, & Salzberg SL (2018). KrakenUniq: Confident and fast metagenomics classification using unique k-mer counts. Genome Biology, 19(1), 1–10. 10.1186/s13059-018-1568-0 - DOI - PMC - PubMed
    1. Lu J, & Salzberg SL (2018). Removing contaminants from databases of draft genomes. PLoS Computational Biology, 14(6), e1006277. 10.1371/journal.pcbi.1006277 - DOI - PMC - PubMed
    1. Roberts M, Hayes W, Hunt BR, Mount SM, & Yorke JA (2004). Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18), 3363–3369. 10.1093/bioinformatics/bth408 - DOI - PubMed
    1. Salzberg S, Breitwieser F, Kumar A, Hao H, Burger P, Rodriguez F, Lim M,Quiñones-Hinojosa A, Gallia G, Tornheim J, Melia M, Sears C, & Pardo C (2016). Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurology: Neuroimmunology and Neuroinflammation, 3(4), e251. - PMC - PubMed

LinkOut - more resources