. 2013;14 Suppl 8(Suppl 8):S4.

doi: 10.1186/1471-2164-14-S8-S4. Epub 2013 Dec 9.

MetaID: a novel method for identification and quantification of metagenomic samples

Satish M Srinivasan, Chittibabu Guda

PMID: 24564518
PMCID: PMC4042266
DOI: 10.1186/1471-2164-14-S8-S4

MetaID: a novel method for identification and quantification of metagenomic samples

Satish M Srinivasan et al. BMC Genomics. 2013.

. 2013;14 Suppl 8(Suppl 8):S4.

doi: 10.1186/1471-2164-14-S8-S4. Epub 2013 Dec 9.

Authors

Satish M Srinivasan, Chittibabu Guda

PMID: 24564518
PMCID: PMC4042266
DOI: 10.1186/1471-2164-14-S8-S4

Abstract

Background: Advances in next-generation sequencing (NGS) technology has provided us with an opportunity to analyze and evaluate the rich microbial communities present in all natural environments. The shorter reads obtained from the shortgun technology has paved the way for determining the taxonomic profile of a community by simply aligning the reads against the available reference genomes. While several computational methods are available for taxonomic profiling at the genus- and species-level, none of these methods are effective at the strain-level identification due to the increasing difficulty in detecting variation at that level. Here, we present MetaID, an alignment-free n-gram based approach that can accurately identify microorganisms at the strain level and estimate the abundance of each organism in a sample, given a metagenomic sequencing dataset.

Results: MetaID is an n-gram based method that calculates the profile of unique and common n-grams from the dataset of 2,031 prokaryotic genomes and assigns weights to each n-gram using a scoring function. This scoring function assigns higher weightage to the n-grams that appear in fewer genomes and vice versa; thus, allows for effective use of both unique and common n-grams for species identification. Our 10-fold cross-validation results on a simulated dataset show a remarkable accuracy of 99.7% at the strain-level identification of the organisms in gut microbiome. We also demonstrated that our model shows impressive performance even by using only 25% or 50% of the genome sequences for modeling. In addition to identification of the species, our method can also estimate the relative abundance of each species in the simulated metagenomic samples. The generic approach employed in this method can be applied for accurate identification of a wide variety of microbial species (viruses, prokaryotes and eukaryotes) present in any environmental sample.

Conclusions: The proposed scoring function and approach is able to accurately identify and estimate the entire taxa in any metagenomic community. The weights assigned to the common n-grams by our scoring function are precisely calibrated to match the reads up to the strain level. Our multipronged validation tests demonstrate that MetaID is sufficiently robust to accurately identify and estimate the abundance of each taxon in any natural environment even when using incomplete or partially sequenced genomes.

PubMed Disclaimer

Figures

**Figure 1**
**A schematic diagram showing the methodology and scoring function**. SC_COL_UM146 - *Escherichia coli* UM146, STA_NAS_DSM_44728 - *Stackebrandtia nassauensis* DSM 44728, RHO_PAL_BisB18 - *Rhodopseudomonas palustris* BisB18, LAC_FER_CECT_5716 - *Lactobacillus fermentum* CECT 5716, NOS_PUN_PCC_73102 - *Nostoc punctiforme* PCC 73102

**Figure 2**
**Number of common and unique n-grams as a function of the size of n**. The sizes of n-grams are varied from 9 to 18 each at a multiple of 3.

**Figure 3**
**Comparison of the accuracies across 2,031 bacterial genomes using both the common and unique n-grams (n = 12) (Model 2) and only unique n-grams (n = 12) (Model1)**. Scale on the second Y-axis denotes the untransformed accuracies.

**Figure 4**
**Accuracies of different models (α::β) using 1, 3, 5 and 7% of the total n-grams from each genome**. 75:100 or so forth indicates that the model was built using only 75% of each genome from the reference set and validation was performed using 100% of the genome. Scale on the second Y-axis denotes the untransformed accuracies.

**Figure 5**
**Comparison of the original and estimated abundances (relative percentage) for 100 microbial genomes in the mock-staggered dataset**.

See this image and copyright information in PMC

Cited by

WGSQuikr: fast whole-genome shotgun metagenomic classification.
Koslicki D, Foucart S, Rosen G. Koslicki D, et al. PLoS One. 2014 Mar 13;9(3):e91784. doi: 10.1371/journal.pone.0091784. eCollection 2014. PLoS One. 2014. PMID: 24626336 Free PMC article.
StrainIQ: A Novel n-Gram-Based Method for Taxonomic Profiling of Human Microbiota at the Strain Level.
Pandey S, Avuthu N, Guda C. Pandey S, et al. Genes (Basel). 2023 Aug 18;14(8):1647. doi: 10.3390/genes14081647. Genes (Basel). 2023. PMID: 37628698 Free PMC article.
Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection.
Lambert C, Braxton C, Charlebois RL, Deyati A, Duncan P, La Neve F, Malicki HD, Ribrioux S, Rozelle DK, Michaels B, Sun W, Yang Z, Khan AS. Lambert C, et al. Viruses. 2018 Sep 27;10(10):528. doi: 10.3390/v10100528. Viruses. 2018. PMID: 30262776 Free PMC article.
Detection of somatic mutations in tumors using unaligned clonal sequencing data.
Sutton KM, Crinnion LA, Wallace D, Harrison S, Roberts P, Watson CM, Markham AF, Bonthron DT, Quirke P, Carr IM. Sutton KM, et al. Lab Invest. 2014 Oct;94(10):1173-83. doi: 10.1038/labinvest.2014.96. Epub 2014 Jul 28. Lab Invest. 2014. PMID: 25068661
Interdisciplinary dialogue for education, collaboration, and innovation: intelligent Biology and Medicine in and beyond 2013.
Zhang B, Huang Y, McDermott JE, Posey RH, Xu H, Zhao Z. Zhang B, et al. BMC Genomics. 2013;14 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2164-14-S8-S1. Epub 2013 Dec 9. BMC Genomics. 2013. PMID: 24564388 Free PMC article.

See all "Cited by" articles

References

1. Clemente JC, Ursell LK, Parfrey LW, Knight R. The Impact of the Gut Microbiota on Human Health: An Integrative View. Cell. 2012. p. 148. - PMC - PubMed
1. Kinross JM, Darzi AW, Nicholson JK. Gut microbiome-host interactions in health and disease. Genome Medicine. 2011;14:14. doi: 10.1186/gm228. - DOI - PMC - PubMed
1. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods. 2012;14(8):811–814. doi: 10.1038/nmeth.2066. - DOI - PMC - PubMed
1. Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, Liang S, Zhang W, Guan Y, Shen D, Peng Y, Zhang D, Jie Z, Wu W, Qin Y, Xue W. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;14:55–60. doi: 10.1038/nature11450. - DOI - PubMed
1. Karlsson FH, Fak F, Nookaew I, Tremaroli V, Fagerberg B, Petranovic D, Backhed F, Nielsen J. Symptomatic atherosclerosis is associated with an altered gut metagenome. Nature Communications. 2012;14:1245. 1-8. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

1R01GM086533/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- BacDive

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MetaID: a novel method for identification and quantification of metagenomic samples

MetaID: a novel method for identification and quantification of metagenomic samples

Authors

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases