. 2024 Jan 15;25(1):26.

doi: 10.1186/s12859-024-05639-3.

Methodology for biomarker discovery with reproducibility in microbiome data using machine learning

David Rojas-Velazquez^{1

2}, Sarah Kidwai³, Aletta D Kraneveld^{3

4}, Alberto Tonda⁵, Daniel Oberski⁶, Johan Garssen^{3

7}, Alejandro Lopez-Rincon^{3

6}

Affiliations

¹ Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, University of Utrecht, Utrecht, The Netherlands. e.d.rojasvelazquez@uu.nl.
² Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands. e.d.rojasvelazquez@uu.nl.
³ Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, University of Utrecht, Utrecht, The Netherlands.
⁴ Department of Neuroscience, Faculty of Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁵ UMR 518 MIA - PS, INRAE, Institut des Systèmes Complexes de Paris, Île - de - France (ISC-PIF) - UAR 3611 CNRS, Université Paris-Saclay, Paris, France.
⁶ Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.
⁷ Global Centre of Excellence Immunology, Danone Nutricia Research, Utrecht, The Netherlands.

PMID: 38225565
PMCID: PMC10789030
DOI: 10.1186/s12859-024-05639-3

Methodology for biomarker discovery with reproducibility in microbiome data using machine learning

David Rojas-Velazquez et al. BMC Bioinformatics. 2024.

. 2024 Jan 15;25(1):26.

doi: 10.1186/s12859-024-05639-3.

Authors

David Rojas-Velazquez^{1

2}, Sarah Kidwai³, Aletta D Kraneveld^{3

4}, Alberto Tonda⁵, Daniel Oberski⁶, Johan Garssen^{3

7}, Alejandro Lopez-Rincon^{3

6}

Affiliations

¹ Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, University of Utrecht, Utrecht, The Netherlands. e.d.rojasvelazquez@uu.nl.
² Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands. e.d.rojasvelazquez@uu.nl.
³ Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, University of Utrecht, Utrecht, The Netherlands.
⁴ Department of Neuroscience, Faculty of Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
⁵ UMR 518 MIA - PS, INRAE, Institut des Systèmes Complexes de Paris, Île - de - France (ISC-PIF) - UAR 3611 CNRS, Université Paris-Saclay, Paris, France.
⁶ Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.
⁷ Global Centre of Excellence Immunology, Danone Nutricia Research, Utrecht, The Netherlands.

PMID: 38225565
PMCID: PMC10789030
DOI: 10.1186/s12859-024-05639-3

Abstract

Background: In recent years, human microbiome studies have received increasing attention as this field is considered a potential source for clinical applications. With the advancements in omics technologies and AI, research focused on the discovery for potential biomarkers in the human microbiome using machine learning tools has produced positive outcomes. Despite the promising results, several issues can still be found in these studies such as datasets with small number of samples, inconsistent results, lack of uniform processing and methodologies, and other additional factors lead to lack of reproducibility in biomedical research. In this work, we propose a methodology that combines the DADA2 pipeline for 16s rRNA sequences processing and the Recursive Ensemble Feature Selection (REFS) in multiple datasets to increase reproducibility and obtain robust and reliable results in biomedical research.

Results: Three experiments were performed analyzing microbiome data from patients/cases in Inflammatory Bowel Disease (IBD), Autism Spectrum Disorder (ASD), and Type 2 Diabetes (T2D). In each experiment, we found a biomarker signature in one dataset and applied to 2 other as further validation. The effectiveness of the proposed methodology was compared with other feature selection methods such as K-Best with F-score and random selection as a base line. The Area Under the Curve (AUC) was employed as a measure of diagnostic accuracy and used as a metric for comparing the results of the proposed methodology with other feature selection methods. Additionally, we use the Matthews Correlation Coefficient (MCC) as a metric to evaluate the performance of the methodology as well as for comparison with other feature selection methods.

Conclusions: We developed a methodology for reproducible biomarker discovery for 16s rRNA microbiome sequence analysis, addressing the issues related with data dimensionality, inconsistent results and validation across independent datasets. The findings from the three experiments, across 9 different datasets, show that the proposed methodology achieved higher accuracy compared to other feature selection methods. This methodology is a first approach to increase reproducibility, to provide robust and reliable results.

Keywords: Machine learning; Microbiome; Reproducibility.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
a The minimum number of features to obtain the higher accuracy, b Plot of the classifier with the best performance in the validation process for discovery dataset David et al, c Plot of the classifier with the best performance in the validation process for PRJNA589343, and d Plot of the classifier with the best performance in the validation process for PRJNA578223

**Fig. 2**
a The minimum number of features to obtain the higher accuracy, b Plot of the classifier with the best performance in the validation process for discovery dataset PRJEB2150, c Plot of the classifier with the best performance in the validation process for DRA00609, and d Plot of the classifier with the best performance in the validation process for PRJNA684584

**Fig. 3**
a The minimum number of features to obtain the higher accuracy, b Plot of the classifier with the best performance in the validation process for discovery dataset PRJNA325931, c Plot of the classifier with the best performance in the validation process for PRJNA554535, and d Plot of the classifier with the best performance in the validation process for PRJEB53017

**Fig. 4**
Overview of the proposed methodology. The upper shows the workflow for the dataset selection criteria, raw data processing and feature selection phases. The lower part shows the testing phase workflow

**Fig. 5**
Overview of the datasets used for each experiment

See this image and copyright information in PMC

References

1. Cani PD. Human gut microbiome: hopes, threats and promises. Gut. 2018;67(9):1716–1725. doi: 10.1136/gutjnl-2018-316723. - DOI - PMC - PubMed
1. Khan I, Ullah N, Zha L, Bai Y, Khan A, Zhao T, Che T, Zhang C. Alteration of gut microbiota in inflammatory bowel disease (ibd): Cause or consequence? ibd treatment targeting the gut microbiome. Pathogens. 2019;8(3):126. doi: 10.3390/pathogens8030126. - DOI - PMC - PubMed
1. Dickson I. Diagnosing ibd with the gut microbiome. Nat Rev Gastroenterol Hepatol. 2017;14(4):195–195. doi: 10.1038/nrgastro.2017.25. - DOI - PubMed
1. McIlroy J, Ianiro G, Mukhopadhya I, Hansen R, Hold G. the gut microbiome in inflammatory bowel disease-avenues for microbial management. Aliment Pharmacol Ther. 2018;47(1):26–42. doi: 10.1111/apt.14384. - DOI - PubMed
1. Michail S, Durbin M, Turner D, Griffiths AM, Mack DR, Hyams J, Leleiko N, Kenche H, Stolfi A, Wine E. Alterations in the gut microbiome of children with severe ulcerative colitis. Inflamm Bowel Dis. 2012;18(10):1799–1808. doi: 10.1002/ibd.22860. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Methodology for biomarker discovery with reproducibility in microbiome data using machine learning

Affiliations

Methodology for biomarker discovery with reproducibility in microbiome data using machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical