Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul 28:17:292.
doi: 10.1186/s12859-016-1159-6.

Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations

Affiliations

Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations

Aaron Y Lee et al. BMC Bioinformatics. .

Abstract

Background: Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results.

Results: After retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 10(11) base pairs of 3.3 × 10(6) sequences from 9.25 × 10(5) species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 10(6) reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK.

Conclusions: SMART is the first scalable, efficient, and rapid metagenomics classification algorithm capable of matching against all the species and sequences present in the NCBI GenBank and allows for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample may have been derived.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Sketch of search strategy in pseudocode
Fig. 2
Fig. 2
Accuracy results of deep search on simulated datasets using the Genbank library. a Sensitivity with exact matching at species, genus, and class levels for simulated datasets (HiSeq, MiSeq, and simBA5). b Sensitivity with 1 basepair permutations during search. c Precision with exact matching. d Precision with 1 basepair permutations during search. Error bars represent 95 % confidence intervals
Fig. 3
Fig. 3
Computational scalability of SMART on a computing cluster using the Genbank library. a, c, e, g Overall execution time to complete processing of datasets with increasing number of computing nodes utilized. b, d, f, h Throughput measured in reads per minute processed with increasing number of computing nodes utilized. i Execution time of datasets with 12 nodes utilized and 1 basepair permutations during search. j Throughput of datasets with 12 nodes utilized and 1 basepair permutations during search
Fig. 4
Fig. 4
Twenty most common genera in metagenomics samples. a Human conjunctival metagenomics whole genome sequencing samples and b Human gut metagenomics whole genome sequencing samples with total reads normalized by coverage of human genome
Fig. 5
Fig. 5
Comparison of accuracy, throughput, and memory utilization among Kraken, CLARK, and SMART built from the same RefSeq sequences. a, b, c Sensitivity at the level of species, genus, and class for simulated datasets (HiSeq, MiSeq, and simBA5). d, e, f Precision at the level of species, genus, and class. Throughput (g) and memory utilization (h) of datasets with 16 parallel threads in the same computing environment

References

    1. Comin M, Leoni A, Schimd M. Clustering of reads with alignment-free measures and quality values. Algorithms Mol Biol. 2015;10:4. doi: 10.1186/s13015-014-0029-x. - DOI - PMC - PubMed
    1. Lindner MS, Renard BY. Metagenomic abundance estimation and diagnostic testing on species level. Nucleic Acids Res. 2013;41:e10. doi: 10.1093/nar/gks803. - DOI - PMC - PubMed
    1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic Acids Res. 2002;30:17–20. doi: 10.1093/nar/30.1.17. - DOI - PMC - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46. doi: 10.1186/gb-2014-15-3-r46. - DOI - PMC - PubMed

Publication types

LinkOut - more resources