Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 16;19(1):198.
doi: 10.1186/s13059-018-1568-0.

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

Affiliations

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

F P Breitwieser et al. Genome Biol. .

Abstract

False-positive identifications are a significant problem in metagenomics classification. We present KrakenUniq, a novel metagenomics classifier that combines the fast k-mer-based classification of Kraken with an efficient algorithm for assessing the coverage of unique k-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory. KrakenUniq is freely available at https://github.com/fbreitwieser/krakenuniq .

Keywords: Infectious disease diagnosis; Metagenomics; Metagenomics classification; Microbiome; Pathogen detection.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Overview of the KrakenUniq algorithm and output. a An input read is shown as a long gray rectangle, with k-mers shown as shorter rectangles below it. The taxon mappings for each k-mer are compared to the database, shown as larger rectangles on the right. For each taxon, a unique k-mer counter is instantiated, and the observed k-mers (K7, K8, and K9) are added to the counters. b Unique k-mer counting is implemented with the probabilistic estimation method HyperLogLog (HLL) using 16 KB of memory per counter, which limits the error in the cardinality estimate to 1% (see main text). c The output includes the number of reads, unique k-mers, duplicity (average time each k-mer has been seen), and coverage for each taxon observed in the input data
Fig. 2
Fig. 2
Cardinality estimation using HyperLogLog for randomly sampled k-mers from microbial genomes. Left: standard deviations of the relative errors of the estimate with precision p ranging from 10 to 18. No systematic biases are apparent, and, as expected, the errors decrease with higher values of p. Up to cardinalities of about 2p/4, the relative error is near zero. At higher cardinalities, the error boundaries stay near constant. Right: the size of the registers, space requirement, and expected relative error for HyperLogLog cardinality estimates with different values of p. For example, with a precision p = 14, the expected relative error is 0.81%, and the counter only requires 16 KB of space, which is three orders of magnitude less than that of an exact counter (at a cardinality of one million). Up to cardinalities of 2p/4, KrakenUniq uses a sparse representation of the counter with a higher precision of 25 and an effective relative error rate of about 0.02%
Fig. 3
Fig. 3
Unique k-mer count separates true and false positives better than read counts in a complex dataset with ten million reads sampled from SRA experiments. Each dot represents a species, with true species in orange and false species in black. The dashed and dotted lines show the k-mer thresholds for the ideal F1 score and recall at a maximum of 5% FDR, respectively. In this dataset, a unique k-mer count in the range 10,000–20,000 would give the best threshold for selecting true species
Fig. 4
Fig. 4
Deeper sequencing depths require higher unique k-mer count thresholds to control the false-positive rate and achieve the best recall. A minimum threshold of about 2000 unique k-mer per a million reads gives the best results in this dataset (solid line in plot), see Additional file 3: Table S8 for more details

References

    1. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinform. 2017. 10.1093/bib/bbx120. - PMC - PubMed
    1. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12:87. doi: 10.1186/s12915-014-0087-z. - DOI - PMC - PubMed
    1. Thoendel M, Jeraldo P, Greenwood-Quaintance KE, Yao J, Chia N, Hanssen AD, Abdel MP, Patel R. Impact of contaminating DNA in whole-genome amplification kits used for metagenomic shotgun sequencing for infection diagnosis. J Clin Microbiol. 2017;55:1789–1801. doi: 10.1128/JCM.02402-16. - DOI - PMC - PubMed
    1. Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P, Rodriguez FJ, Lim M, Quinones-Hinojosa A, Gallia GL, Tornheim JA, et al. Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm. 2016;3:e251. doi: 10.1212/NXI.0000000000000251. - DOI - PMC - PubMed
    1. Brown JR, Bharucha T, Breuer J. Encephalitis diagnosis using metagenomics: application of next generation sequencing for undiagnosed cases. J Inf Secur. 2018;76:225–240. - PMC - PubMed

Publication types

LinkOut - more resources