MSC: a metagenomic sequence classification algorithm

Subrata Saha¹, Jethro Johnson², Soumitra Pal³, George M Weinstock², Sanguthevar Rajasekaran⁴

Affiliations

¹ Healthcare and Life Sciences Division, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA.
² The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
³ National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, USA.
⁴ Computer Science and Engineering Department, University of Connecticut, Storrs, CT, USA.

PMID: 30649204
PMCID: PMC6931357
DOI: 10.1093/bioinformatics/bty1071

MSC: a metagenomic sequence classification algorithm

Subrata Saha et al. Bioinformatics. 2019.

. 2019 Sep 1;35(17):2932-2940.

doi: 10.1093/bioinformatics/bty1071.

Authors

Subrata Saha¹, Jethro Johnson², Soumitra Pal³, George M Weinstock², Sanguthevar Rajasekaran⁴

Affiliations

¹ Healthcare and Life Sciences Division, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA.
² The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
³ National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, USA.
⁴ Computer Science and Engineering Department, University of Connecticut, Storrs, CT, USA.

PMID: 30649204
PMCID: PMC6931357
DOI: 10.1093/bioinformatics/bty1071

Abstract

Motivation: Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences.

Results: Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances.

Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.

PubMed Disclaimer

Figures

**Fig. 1.**
Comparison of various performance metrics (Precision, Recall, F1score and ASS) of different algorithms we experimented with (MSC, CLARK-S, CLARK, Kraken) on five datasets (SIM-1, SIM-2, SIM-3, MOCK-1 and MOCK-2). The actual abundance of taxonomic levels present in MOCK-1 and MOCK-2 are not known, hence we could not compute ASS for them

**Fig. 2.**
Comparison of time and memory taken by the algorithms (MSC, CLARK-S, CLARK, Kraken) on five datasets (SIM-1, SIM-2, SIM-3, MOCK-1 and MOCK-2). Note that memory usage is shown in log scale

See this image and copyright information in PMC

References

1. Ames S.K., et al. (2013) Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics, 29, 2253–2260. - PMC - PubMed
1. Angly F.E., et al. (2012) Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res., 40, e94. - PMC - PubMed
1. Bazinet A.L., Cummings M.P. (2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92. - PMC - PubMed
1. Benson D.A., et al. (2008) Genbank. Nucleic Acids Res., 36, D25. - PMC - PubMed
1. Buhler J., Tompa M. (2002) Finding motifs using random projections. J. Comput. Biol., 9, 225–242. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

P30 CA034196/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MSC: a metagenomic sequence classification algorithm

Affiliations

MSC: a metagenomic sequence classification algorithm

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources