Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 13:7:11257.
doi: 10.1038/ncomms11257.

Fast and sensitive taxonomic classification for metagenomics with Kaiju

Affiliations

Fast and sensitive taxonomic classification for metagenomics with Kaiju

Peter Menzel et al. Nat Commun. .

Abstract

Metagenomics emerged as an important field of research not only in microbial ecology but also for human health and disease, and metagenomic studies are performed on increasingly larger scales. While recent taxonomic classification programs achieve high speed by comparing genomic k-mers, they often lack sensitivity for overcoming evolutionary divergence, so that large fractions of the metagenomic reads remain unclassified. Here we present the novel metagenome classifier Kaiju, which finds maximum (in-)exact matches on the protein-level using the Burrows-Wheeler transform. We show in a genome exclusion benchmark that Kaiju classifies reads with higher sensitivity and similar precision compared with current k-mer-based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies up to 10 times more reads in real metagenomes. Kaiju can process millions of reads per minute and can run on a standard PC. Source code and web server are available at http://kaiju.binf.ku.dk.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Genus-level sensitivity and precision.
Sensitivity and precision are shown as average for each bin of genera for the five different types of reads and the three programs. The x-axis denotes the number of genomes in the genus and the total number of genomes in that category. For example, 212 of the measured 882 genomes belong to the 106 genera with only 2 available genomes, and the data points show the mean sensitivity and precision across all 212 genomes in that category. Kaiju was run in MEM mode with length threshold m=11 a.a. and in Greedy mode with either 1 or up to 5 allowed mismatches and a score threshold s=65. Kraken uses k=31 and Clark was run with both k=31 and k=20, which is denoted by the dotted line.
Figure 2
Figure 2. Average sensitivity and precision.
For each of the five types of reads, sensitivity and precision were averaged over all 882 measured genomes in the benchmark, showing the overall performance of each program.
Figure 3
Figure 3. Classification of real metagenomes.
Percentage of classified reads in 10 real metagenomes for Kaiju MEM (m=12) and Greedy-5 (s=70), as well as Kraken (k=31). The Merged column shows the percentage of reads that are classified by at least one of Greedy-5 or Kraken. The Venn-Bar-diagram visualizes the percentage of reads that are classified either only by Kraken (blue), Greedy-5 (orange) or both (yellow). Grey bars in the human and cat samples denote the percentage of reads mapped to the respective host genomes.
Figure 4
Figure 4. Classification speed.
Performance was measured in processed reads per second for each program using 25 parallel threads for classifying a set of 27.24 m simulated reads for the five different read types.
Figure 5
Figure 5. Kaiju's algorithm.
First, a sequencing read is translated into the six possible reading frames and the resulting amino acid sequences are split into fragments at stop codons. Fragments are then sorted either by their length (MEM mode) or by their BLOSUM62 score (Greedy mode). This sorted list of fragments is then searched against the reference protein database using the backwards search algorithm on the BWT. While MEM mode only allows exact matches, Greedy mode extends matches at their left end by allowing substitutions. Once the remaining fragments in the list are shorter than the best match obtained so far (MEM) or cannot achieve a better score (Greedy), the search stops and the taxon identifier of the corresponding database sequence is retrieved.

References

    1. Riesenfeld C., Schloss P. & Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet. 38, 525–552 (2004). - PubMed
    1. Shokralla S., Spall J., Gibson J. & Hajibabaei M. Next-generation sequencing technologies for environmental DNA research. Mol. Ecol. 21, 1794–1805 (2012). - PubMed
    1. Segata N. et al.. Computational meta'omics for microbial community studies. Mol. Syst. Biol. 9, 666 (2013). - PMC - PubMed
    1. Kinross J., von Roon A., Holmes E., Darzi A. & Nicholson J. The human gut microbiome: implications for future health care. Curr. Gastroenterol. Rep. 10, 396–403 (2008). - PubMed
    1. Wade W. The oral microbiome in health and disease. Pharmacol. Res. 69, 137–143 (2013). - PubMed

Publication types