. 2018 Nov 16;19(1):198.

doi: 10.1186/s13059-018-1568-0.

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

F P Breitwieser¹, D N Baker^{2

3}, S L Salzberg^{4

5

6}

Affiliations

¹ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. florian.bw@gmail.com.
² Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
⁴ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. salzberg@jhu.edu.
⁵ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. salzberg@jhu.edu.
⁶ Departments of Biomedical Engineering and Biostatistics, Johns Hopkins University, Baltimore, MD, USA. salzberg@jhu.edu.

PMID: 30445993
PMCID: PMC6238331
DOI: 10.1186/s13059-018-1568-0

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

F P Breitwieser et al. Genome Biol. 2018.

. 2018 Nov 16;19(1):198.

doi: 10.1186/s13059-018-1568-0.

Authors

F P Breitwieser¹, D N Baker^{2

3}, S L Salzberg^{4

5

6}

Affiliations

¹ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. florian.bw@gmail.com.
² Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
⁴ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. salzberg@jhu.edu.
⁵ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. salzberg@jhu.edu.
⁶ Departments of Biomedical Engineering and Biostatistics, Johns Hopkins University, Baltimore, MD, USA. salzberg@jhu.edu.

PMID: 30445993
PMCID: PMC6238331
DOI: 10.1186/s13059-018-1568-0

Abstract

False-positive identifications are a significant problem in metagenomics classification. We present KrakenUniq, a novel metagenomics classifier that combines the fast k-mer-based classification of Kraken with an efficient algorithm for assessing the coverage of unique k-mers found in each species in a dataset. On various test datasets, KrakenUniq gives better recall and precision than other methods and effectively classifies and distinguishes pathogens with low abundance from false positives in infectious disease samples. By using the probabilistic cardinality estimator HyperLogLog, KrakenUniq runs as fast as Kraken and requires little additional memory. KrakenUniq is freely available at https://github.com/fbreitwieser/krakenuniq .

Keywords: Infectious disease diagnosis; Metagenomics; Metagenomics classification; Microbiome; Pathogen detection.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Overview of the KrakenUniq algorithm and output. a An input read is shown as a long gray rectangle, with k-mers shown as shorter rectangles below it. The taxon mappings for each k-mer are compared to the database, shown as larger rectangles on the right. For each taxon, a unique k-mer counter is instantiated, and the observed k-mers (K7, K8, and K9) are added to the counters. b Unique k-mer counting is implemented with the probabilistic estimation method HyperLogLog (HLL) using 16 KB of memory per counter, which limits the error in the cardinality estimate to 1% (see main text). c The output includes the number of reads, unique k-mers, duplicity (average time each k-mer has been seen), and coverage for each taxon observed in the input data

**Fig. 2**
Cardinality estimation using HyperLogLog for randomly sampled k-mers from microbial genomes. Left: standard deviations of the relative errors of the estimate with precision p ranging from 10 to 18. No systematic biases are apparent, and, as expected, the errors decrease with higher values of p. Up to cardinalities of about 2^p/4, the relative error is near zero. At higher cardinalities, the error boundaries stay near constant. Right: the size of the registers, space requirement, and expected relative error for HyperLogLog cardinality estimates with different values of p. For example, with a precision p = 14, the expected relative error is 0.81%, and the counter only requires 16 KB of space, which is three orders of magnitude less than that of an exact counter (at a cardinality of one million). Up to cardinalities of 2^p/4, KrakenUniq uses a sparse representation of the counter with a higher precision of 25 and an effective relative error rate of about 0.02%

**Fig. 3**
Unique k-mer count separates true and false positives better than read counts in a complex dataset with ten million reads sampled from SRA experiments. Each dot represents a species, with true species in orange and false species in black. The dashed and dotted lines show the k-mer thresholds for the ideal F1 score and recall at a maximum of 5% FDR, respectively. In this dataset, a unique k-mer count in the range 10,000–20,000 would give the best threshold for selecting true species

**Fig. 4**
Deeper sequencing depths require higher unique k-mer count thresholds to control the false-positive rate and achieve the best recall. A minimum threshold of about 2000 unique k-mer per a million reads gives the best results in this dataset (solid line in plot), see Additional file 3: Table S8 for more details

See this image and copyright information in PMC

References

1. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinform. 2017. 10.1093/bib/bbx120. - PMC - PubMed
1. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12:87. doi: 10.1186/s12915-014-0087-z. - DOI - PMC - PubMed
1. Thoendel M, Jeraldo P, Greenwood-Quaintance KE, Yao J, Chia N, Hanssen AD, Abdel MP, Patel R. Impact of contaminating DNA in whole-genome amplification kits used for metagenomic shotgun sequencing for infection diagnosis. J Clin Microbiol. 2017;55:1789–1801. doi: 10.1128/JCM.02402-16. - DOI - PMC - PubMed
1. Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P, Rodriguez FJ, Lim M, Quinones-Hinojosa A, Gallia GL, Tornheim JA, et al. Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm. 2016;3:e251. doi: 10.1212/NXI.0000000000000251. - DOI - PMC - PubMed
1. Brown JR, Bharucha T, Breuer J. Encephalitis diagnosis using metagenomics: application of next generation sequencing for undiagnosed cases. J Inf Secur. 2018;76:225–240. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

Affiliations

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous