Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 1;35(17):2932-2940.
doi: 10.1093/bioinformatics/bty1071.

MSC: a metagenomic sequence classification algorithm

Affiliations

MSC: a metagenomic sequence classification algorithm

Subrata Saha et al. Bioinformatics. .

Abstract

Motivation: Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences.

Results: Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances.

Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Comparison of various performance metrics (Precision, Recall, F1score and ASS) of different algorithms we experimented with (MSC, CLARK-S, CLARK, Kraken) on five datasets (SIM-1, SIM-2, SIM-3, MOCK-1 and MOCK-2). The actual abundance of taxonomic levels present in MOCK-1 and MOCK-2 are not known, hence we could not compute ASS for them
Fig. 2.
Fig. 2.
Comparison of time and memory taken by the algorithms (MSC, CLARK-S, CLARK, Kraken) on five datasets (SIM-1, SIM-2, SIM-3, MOCK-1 and MOCK-2). Note that memory usage is shown in log scale

References

    1. Ames S.K., et al. (2013) Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics, 29, 2253–2260. - PMC - PubMed
    1. Angly F.E., et al. (2012) Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res., 40, e94. - PMC - PubMed
    1. Bazinet A.L., Cummings M.P. (2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92. - PMC - PubMed
    1. Benson D.A., et al. (2008) Genbank. Nucleic Acids Res., 36, D25. - PMC - PubMed
    1. Buhler J., Tompa M. (2002) Finding motifs using random projections. J. Comput. Biol., 9, 225–242. - PubMed

Publication types