. 2024 Jun 5;35(6):1089-1100.

doi: 10.1021/jasms.3c00403. Epub 2024 May 1.

Automated Machine Learning and Explainable AI (AutoML-XAI) for Metabolomics: Improving Cancer Diagnostics

Olatomiwa O Bifarin¹, Facundo M Fernández^{1

2}

Affiliations

¹ School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia 30332, United States.
² Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, Georgia 30332, United States.

PMID: 38690775
PMCID: PMC11157651
DOI: 10.1021/jasms.3c00403

Automated Machine Learning and Explainable AI (AutoML-XAI) for Metabolomics: Improving Cancer Diagnostics

Olatomiwa O Bifarin et al. J Am Soc Mass Spectrom. 2024.

. 2024 Jun 5;35(6):1089-1100.

doi: 10.1021/jasms.3c00403. Epub 2024 May 1.

Authors

Olatomiwa O Bifarin¹, Facundo M Fernández^{1

2}

Affiliations

¹ School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia 30332, United States.
² Petit Institute of Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, Georgia 30332, United States.

PMID: 38690775
PMCID: PMC11157651
DOI: 10.1021/jasms.3c00403

Abstract

Metabolomics generates complex data necessitating advanced computational methods for generating biological insight. While machine learning (ML) is promising, the challenges of selecting the best algorithms and tuning hyperparameters, particularly for nonexperts, remain. Automated machine learning (AutoML) can streamline this process; however, the issue of interpretability could persist. This research introduces a unified pipeline that combines AutoML with explainable AI (XAI) techniques to optimize metabolomics analysis. We tested our approach on two data sets: renal cell carcinoma (RCC) urine metabolomics and ovarian cancer (OC) serum metabolomics. AutoML, using Auto-sklearn, surpassed standalone ML algorithms like SVM and k-Nearest Neighbors in differentiating between RCC and healthy controls, as well as OC patients and those with other gynecological cancers. The effectiveness of Auto-sklearn is highlighted by its AUC scores of 0.97 for RCC and 0.85 for OC, obtained from the unseen test sets. Importantly, on most of the metrics considered, Auto-sklearn demonstrated a better classification performance, leveraging a mix of algorithms and ensemble techniques. Shapley Additive Explanations (SHAP) provided a global ranking of feature importance, identifying dibutylamine and ganglioside GM(d34:1) as the top discriminative metabolites for RCC and OC, respectively. Waterfall plots offered local explanations by illustrating the influence of each metabolite on individual predictions. Dependence plots spotlighted metabolite interactions, such as the connection between hippuric acid and one of its derivatives in RCC, and between GM3(d34:1) and GM3(18:1_16:0) in OC, hinting at potential mechanistic relationships. Through decision plots, a detailed error analysis was conducted, contrasting feature importance for correctly versus incorrectly classified samples. In essence, our pipeline emphasizes the importance of harmonizing AutoML and XAI, facilitating both simplified ML application and improved interpretability in metabolomics data science.

Keywords: Shapley additive explanations; automated machine learning; cancer biology; explainable AI; metabolomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

**Figure 1**
Automated ML-Explainable AI Workflow. (A) Highlight of the challenges associated with ML model selection for nonexperts. Grid and random searches are typically performed by the user to select the best hyperparameters for a model. (B) Auto-Sklearn AutoML system is based on meta-learning, Bayesian optimization, and ensemble construction. (C) Ensemble models constructed via Auto-Sklearn can be interpreted with Explainable AI (XAI) techniques such as LIME and SHAP. (D) Application of AutoML and XAI to a RCC urine and OC serum metabolomics data set. Local interpretable model-agnostic explanations, LIME; Shapley additive explanation, SHAP; renal cell carcinoma, RCC; and ovarian cancer, OC.

**Figure 2**
Machine learning pipeline. The data set was split into training and test sets, and each was subsequently autoscaled. ML models were built using the training set, and their performances were accessed using the test set. AutoML ensemble models were explained using kernel SHAP.

**Figure 3**
Automated machine learning pipelines. Pipeline profile showing the pipeline primitives, pipeline matrix, and the corresponding ROC AUC scores for the (A) RCC data set and (B) OC data set. Only 20 successful ML pipelines are shown in each case. The horizontal gray bar indicates the ROC-AUC values, whereas the vertical gray bar represents the correlation of primitives with ROC-AUC scores. (C) Pipeline graph for a sample autoML pipeline. ML pipeline performance over time during model training for the (D) RCC data set and (E) OC data set. The scores reported include the single best score on the internal training set, the single best optimization score, and the ensemble optimization score.

**Figure 4**
Machine learning interpretations of the ensemble model constructed by AutoML for the RCC data set. (A) Beeswarm plot and (B) summary plot showing global interpretation of the model. (C) Waterfall plot, local explanation for a true positive (RCC) sample. (D) Waterfall plot, local explanation for a true negative (healthy control) sample. (E) Dependence plot showing the interaction between hippuric acid and the hippurate-mannitol derivative. (F) Decision plot highlighting true positive and false negative samples.

**Figure 5**
Machine learning interpretations of the ensemble model constructed by AutoML for the OC data set. (A) Beeswarm plot and (B) Waterfall plot, local explanation for a true positive (OC) sample. (C) Waterfall plot, local explanation for a true negative (non-OC) sample. (D) Dependence plot showing the interaction between GM3(d34:1) and the GM3(18:1_16:1). (E) Decision plot highlighting true positive and false negative samples for OC and non-OC classification.

**Figure 6**
Error analysis decision plots for the RCC diagnostic model. (A) Decision plot for all true negative samples. (B) Decision plot for all false positive samples. (C) Feature importance rank correlation between true negative and false positive samples. (D) Changes in feature importance rank between true negatives vs false positives. (E) Decision plot for all true positive samples. (F) Decision plot for all false negative samples. (G) Feature importance rank correlation between true positive and false negative samples. (H) Changes in feature importance rank between true positive vs false negative. Tau is Kendall’s Tau correlation coefficient.

See this image and copyright information in PMC

Update of

Automated machine learning and explainable AI (AutoML-XAI) for metabolomics: improving cancer diagnostics.
Bifarin OO, Fernández FM. Bifarin OO, et al. bioRxiv [Preprint]. 2023 Oct 31:2023.10.26.564244. doi: 10.1101/2023.10.26.564244. bioRxiv. 2023. Update in: J Am Soc Mass Spectrom. 2024 Jun 5;35(6):1089-1100. doi: 10.1021/jasms.3c00403. PMID: 37961534 Free PMC article. Updated. Preprint.

Cited by

Untargeted Lipidomic Biomarkers for Liver Cancer Diagnosis: A Tree-Based Machine Learning Model Enhanced by Explainable Artificial Intelligence.
Colak C, Yagin FH, Algarni A, Algarni A, Al-Hashem F, Ardigò LP. Colak C, et al. Medicina (Kaunas). 2025 Feb 26;61(3):405. doi: 10.3390/medicina61030405. Medicina (Kaunas). 2025. PMID: 40142216 Free PMC article.
Manual Delineation of the Region of Interest Combined With Clinical Image Analysis to Predict the Ki-67 Expression Level in Non-small Cell Lung Cancer.
Li Y, Zhang J, Lin X. Li Y, et al. Sage Open Pathol. 2025 May 12;18:30502098251336608. doi: 10.1177/30502098251336608. eCollection 2025 Jan-Dec. Sage Open Pathol. 2025. PMID: 40519328 Free PMC article.
Proposed Comprehensive Methodology Integrated with Explainable Artificial Intelligence for Prediction of Possible Biomarkers in Metabolomics Panel of Plasma Samples for Breast Cancer Detection.
Colak C, Yagin FH, Algarni A, Algarni A, Al-Hashem F, Ardigò LP. Colak C, et al. Medicina (Kaunas). 2025 Mar 25;61(4):581. doi: 10.3390/medicina61040581. Medicina (Kaunas). 2025. PMID: 40282875 Free PMC article.
Risk Prediction of Liver Injury in Pediatric Tuberculosis Treatment: Development of an Automated Machine Learning Model.
Zeng Y, Lu H, Li S, Shi QZ, Liu L, Gong YQ, Yan P. Zeng Y, et al. Drug Des Devel Ther. 2025 Jan 13;19:239-250. doi: 10.2147/DDDT.S495555. eCollection 2025. Drug Des Devel Ther. 2025. PMID: 39830784 Free PMC article.
Liquid Biopsy-Based Detection and Response Prediction for Depression.
Kim S, Kang Y, Shin H, Lee EB, Ham BJ, Choi Y. Kim S, et al. ACS Nano. 2024 Nov 26;18(47):32498-32507. doi: 10.1021/acsnano.4c08233. Epub 2024 Nov 5. ACS Nano. 2024. PMID: 39501510

See all "Cited by" articles

References

1. Liebal U. W.; Phan A. N. T.; Sudhakar M.; Raman K.; Blank L. M. Machine Learning Applications for Mass Spectrometry-Based Metabolomics. Metabolites 2020, 10 (6), 243.10.3390/metabo10060243. - DOI - PMC - PubMed
1. Galal A.; Talal M.; Moustafa A. Applications of machine learning in metabolomics: Disease modeling and classification. Front Genet 2022, 13, 1017340.10.3389/fgene.2022.1017340. - DOI - PMC - PubMed
1. Ren S.; Hinzman A. A.; Kang E. L.; Szczesniak R. D.; Lu L. J. Computational and statistical analysis of metabolomics data. Metabolomics 2015, 11, 1492–1513. 10.1007/s11306-015-0823-6. - DOI
1. Boccard J.; Rudaz S. Harnessing the complexity of metabolomic data with chemometrics. Journal of Chemometrics 2014, 28 (1), 1–9. 10.1002/cem.2567. - DOI
1. Zöller M.-A.; Huber M. F. Benchmark and Survey of Automated Machine Learning Frameworks. Journal of Artificial Intelligence Research 2021, 70, 409–474. 10.1613/jair.1.11854. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 CA218664/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated Machine Learning and Explainable AI (AutoML-XAI) for Metabolomics: Improving Cancer Diagnostics

Affiliations

Automated Machine Learning and Explainable AI (AutoML-XAI) for Metabolomics: Improving Cancer Diagnostics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical