Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Sep 22:14:1261889.
doi: 10.3389/fmicb.2023.1261889. eCollection 2023.

Machine learning approaches in microbiome research: challenges and best practices

Affiliations
Review

Machine learning approaches in microbiome research: challenges and best practices

Georgios Papoutsoglou et al. Front Microbiol. .

Abstract

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

Keywords: AutoML; colorectal cancer; feature selection; machine learning methods; microbiome data analysis; model selection; predictive modeling; preprocessing.

PubMed Disclaimer

Conflict of interest statement

GP was directly affiliated with JADBio—Gnosis DA, S.A., which offers the JADBio service commercially. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
The typical process from data preparation to predictive model building, highlighting the methods to consider during each stage.
Figure 2
Figure 2
Sensitivity and specificity of the two best performing ML models (GBM and RF) on 100 data split repetitions applied on the CRC dataset with a range of filter on prevalence (shades of blue) or no filter on prevalence (gray).
Figure 3
Figure 3
Sensitivity and specificity of 7 ML models across 100 data split repetitions applied on the CRC dataset with a CLR logratio transformation before (blue) or no transformation (red).
Figure 4
Figure 4
(A) Features detected by feature selection that generate classification bias. (B) Comparative evaluation of tested pipelines.
Figure 5
Figure 5
(A) Train (blue) and test (green) AUCROC after analyzing the revised dataset. ROC Curve considering CRC patients (P) as the positive class. (B) Out-of-sample (training) predictions for Healthy (H) and CRC patients class (P). (C) Feature importance defined as the percentage drop in predictive performance when the feature is removed from the model. Gray lines indicate 95% confidence intervals. (D) Supervised PCA on the selected features depicts the model performance in separating the two classes and also outlier samples.
Figure 6
Figure 6
(A) ROC of the best interpretable model. (B) Contribution of each species to the prediction from logistic regression as the best interpretable model. Feature Interpretation using ICE plots with an example of a (C) risk factor (the higher the abundance, the higher the probability to be in the P (Patients) class) and a (D) protective factor (the higher the abundance, the lower the probability to be in the P (Patients) class).

References

    1. Aitchison J. (1982). The statistical analysis of compositional data. J. R. Stat. Soc. B 44, 139–160. doi: 10.1111/j.2517-6161.1982.tb01195.x - DOI
    1. Akosa J. (2017). Predictive accuracy: a misleading performance measure for highly imbalanced data. Availanble at: https://www.semanticscholar.org/paper/Predictive-Accuracy-%3A-A-Misleadi...
    1. Barbet P., Almeida M., Probul N., Baumach J., Pons N., Plaza Onate F., et al. (2023). Taxonomic profiles, functional profiles and manually curated metadata of human fecal metagenomes from public projects coming from colorectal cancer studies (version 5) [dataset]. Recher. Data Gouv. doi: 10.57745/7IVO3E - DOI
    1. Behrouzi A., Nafari A. H., Siadat S. D. (2019). The significance of microbiome in personalized medicine. Clin. Transl. Med. 8:e16. doi: 10.1186/s40169-019-0232-y, PMID: - DOI - PMC - PubMed
    1. Bellantuono L., Monaco A., Amoroso N., Lacalamita A., Pantaleo E., Tangaro S., et al. (2022). Worldwide impact of lifestyle predictors of dementia prevalence: an eXplainable artificial intelligence analysis. Front. Big Data 5:1027783. doi: 10.3389/fdata.2022.1027783, PMID: - DOI - PMC - PubMed

LinkOut - more resources