. 2019 Aug 1;14(1):12.

doi: 10.1186/s13062-019-0242-0.

Massive metagenomic data analysis using abundance-based machine learning

Zachary N Harris¹, Eliza Dhungel², Matthew Mosior², Tae-Hyuk Ahn^{3

4}

Affiliations

¹ Department of Biology, Saint Louis University, Saint Louis, MO, 63103, USA.
² Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO, 63103, USA.
³ Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO, 63103, USA. ted.ahn@slu.edu.
⁴ Department of Computer Science, Saint Louis University, Saint Louis, MO, 63103, USA. ted.ahn@slu.edu.

PMID: 31370905
PMCID: PMC6676585
DOI: 10.1186/s13062-019-0242-0

Massive metagenomic data analysis using abundance-based machine learning

Zachary N Harris et al. Biol Direct. 2019.

. 2019 Aug 1;14(1):12.

doi: 10.1186/s13062-019-0242-0.

Authors

Zachary N Harris¹, Eliza Dhungel², Matthew Mosior², Tae-Hyuk Ahn^{3

4}

Affiliations

¹ Department of Biology, Saint Louis University, Saint Louis, MO, 63103, USA.
² Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO, 63103, USA.
³ Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO, 63103, USA. ted.ahn@slu.edu.
⁴ Department of Computer Science, Saint Louis University, Saint Louis, MO, 63103, USA. ted.ahn@slu.edu.

PMID: 31370905
PMCID: PMC6676585
DOI: 10.1186/s13062-019-0242-0

Abstract

Background: Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples.

Results: To distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label.

Conclusion: Our results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity.

Reviewers: This article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul.

Keywords: CAMDA; Machine learning; MetaSUB; Metagenomics; Taxonomy profiling.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no competing interests.

Figures

**Fig. 1**
The analysis pipeline presented in this paper. Here we show the two-pronged approach used in this analysis. The data were analyzed under a read-based and assembly-based approach. In the read-based approach, we used taxonomic profiling for the generation of machine learning features for city prediction. In the assembly-based approach, we used two different reduced representation paradigms to generate features for machine learning features

**Fig. 2**
LDA plots of the read-based approach. a LDA with all species. b LDA with rare species (present in < 5% of samples) removed

**Fig. 3**
Confusion matrices for the read-based approach. a Confusion matrix for the random forest model trained on a random 70/30 train/test data partition. b Confusion matrix for the random forest model trained on a random 70/30 train/test data partition of the rare-species-removed data set

**Fig. 4**
LDA of the assembly-based approach. a LDA of the random paired-end subset assembly (PP). b LDA of the left-only subset assembly (PL)

**Fig. 5**
Confusion matrices for the assembly-based approach. a Confusion matrix for the random forest model trained on a random 70/30 train/test data partition in the random paired-end subset assembly. b Confusion matrix for the random forest model trained on a random 70/30 train/test data partition of the left-only assembly

See this image and copyright information in PMC

Cited by

Metagenomic Studies in Inflammatory Skin Diseases.
Godlewska U, Brzoza P, Kwiecień K, Kwitniewski M, Cichy J. Godlewska U, et al. Curr Microbiol. 2020 Nov;77(11):3201-3212. doi: 10.1007/s00284-020-02163-4. Epub 2020 Aug 19. Curr Microbiol. 2020. PMID: 32813091 Free PMC article. Review.
Involvement of transcribed lncRNA uc.291 and SWI/SNF complex in cutaneous squamous cell carcinoma.
Mancini M, Cappello A, Pecorari R, Lena AM, Montanaro M, Fania L, Ricci F, Di Lella G, Piro MC, Abeni D, Dellambra E, Mauriello A, Melino G, Candi E. Mancini M, et al. Discov Oncol. 2021 May 3;12(1):14. doi: 10.1007/s12672-021-00409-6. Discov Oncol. 2021. PMID: 35201472 Free PMC article.
Serine and one-carbon metabolisms bring new therapeutic venues in prostate cancer.
Ganini C, Amelio I, Bertolo R, Candi E, Cappello A, Cipriani C, Mauriello A, Marani C, Melino G, Montanaro M, Natale ME, Tisone G, Shi Y, Wang Y, Bove P. Ganini C, et al. Discov Oncol. 2021 Oct 27;12(1):45. doi: 10.1007/s12672-021-00440-7. Discov Oncol. 2021. PMID: 35201488 Free PMC article. Review.
Comparison of 16S and whole genome dog microbiomes using machine learning.
Lewis S, Nash A, Li Q, Ahn TH. Lewis S, et al. BioData Min. 2021 Aug 21;14(1):41. doi: 10.1186/s13040-021-00270-x. BioData Min. 2021. PMID: 34419136 Free PMC article.
A machine learning framework to determine geolocations from metagenomic profiling.
Huang L, Xu C, Yang W, Yu R. Huang L, et al. Biol Direct. 2020 Nov 23;15(1):27. doi: 10.1186/s13062-020-00278-z. Biol Direct. 2020. PMID: 33225966 Free PMC article.

See all "Cited by" articles

References

1. Daniel R. The metagenomics of soil. Nat Rev Microbiol. 2005;3(6):470–478. - PubMed
1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, et al. Comparative metagenomics of microbial communities. Science. 2005;308(5721):554–557. - PubMed
1. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449(7164):804–810. - PMC - PubMed
1. Consortium HMP A framework for human microbiome research. Nature. 2012;486(7402):215–221. - PMC - PubMed
1. Consortium HMP Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207–214. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Massive metagenomic data analysis using abundance-based machine learning

Affiliations

Massive metagenomic data analysis using abundance-based machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources