Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification

Pierfrancesco Novielli^{1

2}, Donato Romano^{1

2}, Michele Magarelli¹, Pierpaolo Di Bitonto¹, Domenico Diacono², Annalisa Chiatante¹, Giuseppe Lopalco³, Daniele Sabella³, Vincenzo Venerito³, Pasquale Filannino¹, Roberto Bellotti^{2

4}, Maria De Angelis¹, Florenzo Iannone³, Sabina Tangaro^{1

2}

Affiliations

¹ Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy.
² Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy.
³ Dipartimento di Medicina di Precisione e Rigenerativa e Area Jonica, Università degli Studi di Bari Aldo Moro, Bari, Italy.
⁴ Dipartimento Interateneo di Fisica M. Merlin, Università degli Studi di Bari Aldo Moro, Bari, Italy.

PMID: 38426064
PMCID: PMC10901987
DOI: 10.3389/fmicb.2024.1348974

Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification

Pierfrancesco Novielli et al. Front Microbiol. 2024.

. 2024 Feb 15:15:1348974.

doi: 10.3389/fmicb.2024.1348974. eCollection 2024.

Authors

Affiliations

¹ Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy.
² Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy.
³ Dipartimento di Medicina di Precisione e Rigenerativa e Area Jonica, Università degli Studi di Bari Aldo Moro, Bari, Italy.
⁴ Dipartimento Interateneo di Fisica M. Merlin, Università degli Studi di Bari Aldo Moro, Bari, Italy.

PMID: 38426064
PMCID: PMC10901987
DOI: 10.3389/fmicb.2024.1348974

Abstract

Background: Colorectal cancer (CRC) is a type of tumor caused by the uncontrolled growth of cells in the mucosa lining the last part of the intestine. Emerging evidence underscores an association between CRC and gut microbiome dysbiosis. The high mortality rate of this cancer has made it necessary to develop new early diagnostic methods. Machine learning (ML) techniques can represent a solution to evaluate the interaction between intestinal microbiota and host physiology. Through explained artificial intelligence (XAI) it is possible to evaluate the individual contributions of microbial taxonomic markers for each subject. Our work also implements the Shapley Method Additive Explanations (SHAP) algorithm to identify for each subject which parameters are important in the context of CRC.

Results: The proposed study aimed to implement an explainable artificial intelligence framework using both gut microbiota data and demographic information from subjects to classify a cohort of control subjects from those with CRC. Our analysis revealed an association between gut microbiota and this disease. We compared three machine learning algorithms, and the Random Forest (RF) algorithm emerged as the best classifier, with a precision of 0.729 ± 0.038 and an area under the Precision-Recall curve of 0.668 ± 0.016. Additionally, SHAP analysis highlighted the most crucial variables in the model's decision-making, facilitating the identification of specific bacteria linked to CRC. Our results confirmed the role of certain bacteria, such as Fusobacterium, Peptostreptococcus, and Parvimonas, whose abundance appears notably associated with the disease, as well as bacteria whose presence is linked to a non-diseased state.

Discussion: These findings emphasizes the potential of leveraging gut microbiota data within an explainable AI framework for CRC classification. The significant association observed aligns with existing knowledge. The precision exhibited by the RF algorithm reinforces its suitability for such classification tasks. The SHAP analysis not only enhanced interpretability but identified specific bacteria crucial in CRC determination. This approach opens avenues for targeted interventions based on microbial signatures. Further exploration is warranted to deepen our understanding of the intricate interplay between microbiota and health, providing insights for refined diagnostic and therapeutic strategies.

Keywords: biomarker identification; colorectal cancer; explainable artificial intelligence; machine learning; microbiome; microbiota; precision medicine.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Figures

**Figure 1**
Boxplot of two classes (patients and controls) of the **(A)** age and **(B)** BMI. The symbol * denotes the significance level determined by the Mann-Whitney U rank test for comparing two distributions. **Stands for p-value less or equal then 0.01. ****stands for p-value less or equal then 0.0001.

**Figure 2**
Schematic flowchart of the analysis.

**Figure 3**
**(A)** Average ROC Curve with standard deviation over 20 model runs; **(B)** Average PR Curve with standard deviation over 20 model runs.

**Figure 4**
The images display the top 20 features ranked by their importance. **(A)** RF embedded feature importance. The boxplots represent the distributions of the feature importance coefficient calculated across all validation folds of the model. **(B)** SHAP summary plot depicting Shapley values for each feature. Each point represents a subject's Shapley value, with the y-axis indicating the corresponding feature and the x-axis representing the Shapley value. The color gradient reflects feature values, ranging from low to high, while features are ordered by mean importance, with more important features positioned toward the top.

**Figure 5**
SHAP dependence plot for **(A)** *Fusobacterium* and *Peptostreptococcus*. **(B)** *Porphyromonas* and *Fusobacterium*.

See this image and copyright information in PMC

References

1. Aitchison J. (1982). The statistical analysis of compositional data. J. Royal Stat. Soc. Series B 44, 139–160. 10.1111/j.2517-6161.1982.tb01195.x - DOI
1. Amodeo I., De Nunzio L., Raffaeli G., Borzani G., Griggio I., Conte A., et al. . (2021). A machine and deep learning approach to predict pulmonary hypertension in newborns with congenital diaphragmatic hernia (clannish): protocol for a retrospective study. Plos ONE 16, 724. 10.1371/journal.pone.0259724 - DOI - PMC - PubMed
1. Baxter N. T., Ruffin M. T., Rogers M. A., Schloss P. D. (2016). Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med. 8, 1–10. 10.1186/s13073-016-0290-3 - DOI - PMC - PubMed
1. Bellando-Randone S., Russo E., Venerito V., Matucci-Cerinic M., Iannone F., Tangaro S., et al. . (2021). Exploring the oral microbiome in rheumatic diseases, state of art and future prospective in personalized medicine with an ai approach. J. Pers. Med. 11, 625. 10.3390/jpm11070625 - DOI - PMC - PubMed
1. Bellantuono L., Tommasi R., Pantaleo E., Verri M., Amoroso N., Crucitti P., et al. . (2023). An explainable artificial intelligence analysis of raman spectra for thyroid cancer diagnosis. Sci. Rep. 13, 16590. 10.1038/s41598-023-43856-7 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification

Affiliations

Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources