Machine learning approaches in microbiome research: challenges and best practices

Georgios Papoutsoglou^{1

2}, Sonia Tarazona³, Marta B Lopes^{4

5}, Thomas Klammsteiner^{6

7}, Eliana Ibrahimi⁸, Julia Eckenberger^{9

10}, Pierfrancesco Novielli^{11

12}, Alberto Tonda^{13

14}, Andrea Simeon¹⁵, Rajesh Shigdel¹⁶, Stéphane Béreux^{17

18}, Giacomo Vitali¹⁷, Sabina Tangaro^{11

12}, Leo Lahti¹⁹, Andriy Temko²⁰, Marcus J Claesson^{9

10}, Magali Berland¹⁷

Affiliations

¹ Department of Computer Science, University of Crete, Heraklion, Greece.
² JADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, Greece.
³ Department of Applied Statistics and Operations Research and Quality, Polytechnic University of Valencia, Valencia, Spain.
⁴ Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal.
⁵ Research and Development Unit for Mechanical and Industrial Engineering (UNIDEMI), Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal.
⁶ Department of Ecology, Universität Innsbruck, Innsbruck, Austria.
⁷ Department of Microbiology, Universität Innsbruck, Innsbruck, Austria.
⁸ Department of Biology, University of Tirana, Tirana, Albania.
⁹ School of Microbiology, University College Cork, Cork, Ireland.
¹⁰ APC Microbiome Ireland, Cork, Ireland.
¹¹ Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy.
¹² National Institute for Nuclear Physics, Bari Division, Bari, Italy.
¹³ UMR 518 MIA-PS, INRAE, Paris-Saclay University, Palaiseau, France.
¹⁴ Complex Systems Institute of Paris Ile-de-France (ISC-PIF) - UAR 3611 CNRS, Paris, France.
¹⁵ BioSense Institute, University of Novi Sad, Novi Sad, Serbia.
¹⁶ Department of Clinical Science, University of Bergen, Bergen, Norway.
¹⁷ MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France.
¹⁸ MaIAGE, INRAE, Paris-Saclay University, Jouy-en-Josas, France.
¹⁹ Department of Computing, University of Turku, Turku, Finland.
²⁰ Department of Electrical and Electronic Engineering, University College Cork, Cork, Ireland.

PMID: 37808286
PMCID: PMC10556866
DOI: 10.3389/fmicb.2023.1261889

Review

Machine learning approaches in microbiome research: challenges and best practices

Georgios Papoutsoglou et al. Front Microbiol. 2023.

. 2023 Sep 22:14:1261889.

doi: 10.3389/fmicb.2023.1261889. eCollection 2023.

Authors

Affiliations

¹ Department of Computer Science, University of Crete, Heraklion, Greece.
² JADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, Greece.
³ Department of Applied Statistics and Operations Research and Quality, Polytechnic University of Valencia, Valencia, Spain.
⁴ Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal.
⁵ Research and Development Unit for Mechanical and Industrial Engineering (UNIDEMI), Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal.
⁶ Department of Ecology, Universität Innsbruck, Innsbruck, Austria.
⁷ Department of Microbiology, Universität Innsbruck, Innsbruck, Austria.
⁸ Department of Biology, University of Tirana, Tirana, Albania.
⁹ School of Microbiology, University College Cork, Cork, Ireland.
¹⁰ APC Microbiome Ireland, Cork, Ireland.
¹¹ Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy.
¹² National Institute for Nuclear Physics, Bari Division, Bari, Italy.
¹³ UMR 518 MIA-PS, INRAE, Paris-Saclay University, Palaiseau, France.
¹⁴ Complex Systems Institute of Paris Ile-de-France (ISC-PIF) - UAR 3611 CNRS, Paris, France.
¹⁵ BioSense Institute, University of Novi Sad, Novi Sad, Serbia.
¹⁶ Department of Clinical Science, University of Bergen, Bergen, Norway.
¹⁷ MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France.
¹⁸ MaIAGE, INRAE, Paris-Saclay University, Jouy-en-Josas, France.
¹⁹ Department of Computing, University of Turku, Turku, Finland.
²⁰ Department of Electrical and Electronic Engineering, University College Cork, Cork, Ireland.

PMID: 37808286
PMCID: PMC10556866
DOI: 10.3389/fmicb.2023.1261889

Abstract

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

Keywords: AutoML; colorectal cancer; feature selection; machine learning methods; microbiome data analysis; model selection; predictive modeling; preprocessing.

PubMed Disclaimer

Conflict of interest statement

GP was directly affiliated with JADBio—Gnosis DA, S.A., which offers the JADBio service commercially. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
The typical process from data preparation to predictive model building, highlighting the methods to consider during each stage.

**Figure 2**
Sensitivity and specificity of the two best performing ML models (GBM and RF) on 100 data split repetitions applied on the CRC dataset with a range of filter on prevalence (shades of blue) or no filter on prevalence (gray).

**Figure 3**
Sensitivity and specificity of 7 ML models across 100 data split repetitions applied on the CRC dataset with a CLR logratio transformation before (blue) or no transformation (red).

**Figure 4**
**(A)** Features detected by feature selection that generate classification bias. **(B)** Comparative evaluation of tested pipelines.

**Figure 5**
**(A)** Train (blue) and test (green) AUCROC after analyzing the revised dataset. ROC Curve considering CRC patients (P) as the positive class. **(B)** Out-of-sample (training) predictions for Healthy (H) and CRC patients class (P). **(C)** Feature importance defined as the percentage drop in predictive performance when the feature is removed from the model. Gray lines indicate 95% confidence intervals. **(D)** Supervised PCA on the selected features depicts the model performance in separating the two classes and also outlier samples.

**Figure 6**
**(A)** ROC of the best interpretable model. **(B)** Contribution of each species to the prediction from logistic regression as the best interpretable model. Feature Interpretation using ICE plots with an example of a **(C)** risk factor (the higher the abundance, the higher the probability to be in the P (Patients) class) and a **(D)** protective factor (the higher the abundance, the lower the probability to be in the P (Patients) class).

See this image and copyright information in PMC

References

1. Aitchison J. (1982). The statistical analysis of compositional data. J. R. Stat. Soc. B 44, 139–160. doi: 10.1111/j.2517-6161.1982.tb01195.x - DOI
1. Akosa J. (2017). Predictive accuracy: a misleading performance measure for highly imbalanced data. Availanble at: https://www.semanticscholar.org/paper/Predictive-Accuracy-%3A-A-Misleadi...
1. Barbet P., Almeida M., Probul N., Baumach J., Pons N., Plaza Onate F., et al. (2023). Taxonomic profiles, functional profiles and manually curated metadata of human fecal metagenomes from public projects coming from colorectal cancer studies (version 5) [dataset]. Recher. Data Gouv. doi: 10.57745/7IVO3E - DOI
1. Behrouzi A., Nafari A. H., Siadat S. D. (2019). The significance of microbiome in personalized medicine. Clin. Transl. Med. 8:e16. doi: 10.1186/s40169-019-0232-y, PMID: - DOI - PMC - PubMed
1. Bellantuono L., Monaco A., Amoroso N., Lacalamita A., Pantaleo E., Tangaro S., et al. (2022). Worldwide impact of lifestyle predictors of dementia prevalence: an eXplainable artificial intelligence analysis. Front. Big Data 5:1027783. doi: 10.3389/fdata.2022.1027783, PMID: - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine learning approaches in microbiome research: challenges and best practices

Affiliations

Machine learning approaches in microbiome research: challenges and best practices

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources