Observational Study

. 2024 May 14;25(10):5331.

doi: 10.3390/ijms25105331.

Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Andreas Leiherer^{1

2

3}, Axel Muendlein¹, Sylvia Mink^{2

3}, Arthur Mader^{1

4}, Christoph H Saely^{1

3

4}, Andreas Festa¹, Peter Fraunberger^{2

3}, Heinz Drexel^{1

3

5

6}

Affiliations

¹ Vorarlberg Institute for Vascular Investigation and Treatment (VIVIT), A-6800 Feldkirch, Austria.
² Central Medical Laboratories, A-6800 Feldkirch, Austria.
³ Faculty of Medical Sciences, Private University of the Principality of Liechtenstein, FL-9495 Triesen, Liechtenstein.
⁴ Department of Internal Medicine III, Academic Teaching Hospital Feldkirch, A-6800 Feldkirch, Austria.
⁵ Vorarlberger Landeskrankenhausbetriebsgesellschaft, Academic Teaching Hospital Feldkirch, A-6800 Feldkirch, Austria.
⁶ Drexel University College of Medicine, Philadelphia, PA 19129, USA.

PMID: 38791370
PMCID: PMC11120685
DOI: 10.3390/ijms25105331

Observational Study

Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Andreas Leiherer et al. Int J Mol Sci. 2024.

. 2024 May 14;25(10):5331.

doi: 10.3390/ijms25105331.

Authors

Andreas Leiherer^{1

2

3}, Axel Muendlein¹, Sylvia Mink^{2

3}, Arthur Mader^{1

4}, Christoph H Saely^{1

3

4}, Andreas Festa¹, Peter Fraunberger^{2

3}, Heinz Drexel^{1

3

5

6}

Affiliations

¹ Vorarlberg Institute for Vascular Investigation and Treatment (VIVIT), A-6800 Feldkirch, Austria.
² Central Medical Laboratories, A-6800 Feldkirch, Austria.
³ Faculty of Medical Sciences, Private University of the Principality of Liechtenstein, FL-9495 Triesen, Liechtenstein.
⁴ Department of Internal Medicine III, Academic Teaching Hospital Feldkirch, A-6800 Feldkirch, Austria.
⁵ Vorarlberger Landeskrankenhausbetriebsgesellschaft, Academic Teaching Hospital Feldkirch, A-6800 Feldkirch, Austria.
⁶ Drexel University College of Medicine, Philadelphia, PA 19129, USA.

PMID: 38791370
PMCID: PMC11120685
DOI: 10.3390/ijms25105331

Abstract

Metabolomics, with its wealth of data, offers a valuable avenue for enhancing predictions and decision-making in diabetes. This observational study aimed to leverage machine learning (ML) algorithms to predict the 4-year risk of developing type 2 diabetes mellitus (T2DM) using targeted quantitative metabolomics data. A cohort of 279 cardiovascular risk patients who underwent coronary angiography and who were initially free of T2DM according to American Diabetes Association (ADA) criteria was analyzed at baseline, including anthropometric data and targeted metabolomics, using liquid chromatography (LC)-mass spectroscopy (MS) and flow injection analysis (FIA)-MS, respectively. All patients were followed for four years. During this time, 11.5% of the patients developed T2DM. After data preprocessing, 362 variables were used for ML, employing the Caret package in R. The dataset was divided into training and test sets (75:25 ratio) and we used an oversampling approach to address the classifier imbalance of T2DM incidence. After an additional recursive feature elimination step, identifying a set of 77 variables that were the most valuable for model generation, a Support Vector Machine (SVM) model with a linear kernel demonstrated the most promising predictive capabilities, exhibiting an F1 score of 50%, a specificity of 93%, and balanced and unbalanced accuracies of 72% and 88%, respectively. The top-ranked features were bile acids, ceramides, amino acids, and hexoses, whereas anthropometric features such as age, sex, waist circumference, or body mass index had no contribution. In conclusion, ML analysis of metabolomics data is a promising tool for identifying individuals at risk of developing T2DM and opens avenues for personalized and early intervention strategies.

Keywords: ML; accuracy; artificial intelligence; diabetes; incidence; machine learning; metabolomics; support vector machine.

PubMed Disclaimer

Conflict of interest statement

No potential conflicts of interest relevant to this article are reported by A.L., A.M. (Axel Muendlein), S.M., C.H.S., A.M. (Arthur Mader), A.F., P.F., and H.D.

Figures

**Figure 1**
Identifying important variables by recursive feature elimination. Recursive feature elimination helps to identify important and less important variables and to define the optimal size of ML models, as summarized in Table 2. The plot represents the output of an RFE process generating different models (black dots). It depicts the relation between the different feature subset sizes (=number of available variables (1–362)) for modelling and the resulting performance metric (accuracy = (true positive + true negative)/(true positive + false positive + true negative + false negative)). Using Random Forest (**left**) and TreeBag algorithms (**right**) as functions in RFE, the best models (highlighted as blue dots) were calculated to have 77 and 362 variables, respectively. The process involves repeated cross-validation (method = repeatedcv (10-fold repeated 5 times)) to evaluate the performance of feature subsets. The plots were generated by ggplot using the Caret package in R (CRAN, R [11]).

**Figure 2**
Importance of model variables. The figure depicts the most important variables and the respective importance scores according to the “VarImp()” function in Caret (CRAN, R [11]). Here, the top 20 variables of the “svmLinear2” model are displayed, including hexoses, amino acids (glycine, isoleucine, tyrosine, valine), bile acids (chenodeoxycholic acid = CDCA, deoxycholic acid = DCA, ursodeoxycholic acid = UDCA, glycoursodeoxycholic acid = GUDCA, litocholic acid = LCA, cholic acid = CA), ceramides (N-C 18:1-Cer, N-C 14:0-Cer, N-C 15:0-Cer(H2)), energy metabolism intermediates (alpha-ketoglutaric acid, lactic acid), glycerophospholipids (PE aa C38:1, PC ae C38:5, PE ae C40:6), and a biogenic amine (kynurenine).

**Figure 3**
SHAP diagram of feature importance. The beeswarm plot illustrates the most important features (variables) and the contribution of these individual features to the model’s output using Shapley Additive Explanation (SHAP) values. Each dot represents a SHAP value for a feature and a specific data point, indicating the magnitude and direction of the feature’s impact on the model’s prediction relative to the baseline. The y-axis demonstrates the variable name, in order of importance from top to bottom, and the x-axis the SHAP value scale. It indicates how large the impact of the respective variables is on the model output (T2DM incidence). The gradient color indicates the original value for that variable. C5:1 represents tiglylcarnitine, C10:1 decenoylcarnitine, C16 hexadecanoylcarntine, C14:2 tetradecadienylcarnitine, GUDCA glycoursodeoxycholic acid, CA carnitine, PS aa C:34:2 a phosphatidylserine with a diacyl bond, and N-C13:0-Cer(2H) a dihydroceramide.

See this image and copyright information in PMC

Cited by

Special Issue "Machine Learning and Bioinformatics in Human Health and Disease"-Chances and Challenges.
Leiherer A. Leiherer A. Int J Mol Sci. 2024 Nov 28;25(23):12811. doi: 10.3390/ijms252312811. Int J Mol Sci. 2024. PMID: 39684521 Free PMC article.
Identification of novel diagnostic biomarkers associated with liver metastasis in colon adenocarcinoma by machine learning.
Yang L, Tian Y, Cao X, Wang J, Luo B. Yang L, et al. Discov Oncol. 2024 Oct 10;15(1):542. doi: 10.1007/s12672-024-01398-y. Discov Oncol. 2024. PMID: 39390264 Free PMC article.
Bioinformatics identification of key microRNA-correlated genes associated with hepatocellular carcinoma heterogeneity and prognosis.
Su G, Li Y, Wang J, Liu S, Pan G, Zhan D. Su G, et al. BMC Gastroenterol. 2025 Jul 1;25(1):452. doi: 10.1186/s12876-025-04031-6. BMC Gastroenterol. 2025. PMID: 40596892 Free PMC article.
Ceramides in cardiovascular disease: emerging role as independent risk predictors and novel therapeutic targets.
Klingenberg R, Leiherer A, Dobrev D, Kaski JC, Levkau B, März W, Sossalla S, von Eckardstein A, Drexel H. Klingenberg R, et al. Cardiovasc Res. 2025 Aug 14;121(9):1345-1358. doi: 10.1093/cvr/cvaf093. Cardiovasc Res. 2025. PMID: 40460239 Free PMC article. Review.
Metabolomics: Uncovering Insights into Obesity and Diabetes.
Fazliana M, Gee T, Lim SY, Tsen PY, Nor Hanipah Z, Zainal Abidin NA, Zhuan TY, Mohkiar FH, Ahmad Zamri L, Ahmad H, Draman MS, Yusaini NS, Mohd Nawi MN. Fazliana M, et al. Int J Mol Sci. 2025 Jun 27;26(13):6216. doi: 10.3390/ijms26136216. Int J Mol Sci. 2025. PMID: 40649995 Free PMC article.

See all "Cited by" articles

References

1. Thornton J.M., Shah N.M., Lillycrop K.A., Cui W., Johnson M.R., Singh N. Multigenerational diabetes mellitus. Front. Endocrinol. 2024;14:1245899. doi: 10.3389/fendo.2023.1245899. - DOI - PMC - PubMed
1. Slieker R.C., Donnelly L.A., Akalestou E., Lopez-Noriega L., Melhem R., Güneş A., Azar F.A., Efanov A., Georgiadou E., Muniangi-Muhitu H., et al. Identification of biomarkers for glycaemic deterioration in type 2 diabetes. Nat. Commun. 2023;14:2533. doi: 10.1038/s41467-023-38148-7. - DOI - PMC - PubMed
1. Liu J., Semiz S., van der Lee S.J., van der Spek A., Verhoeven A., van Klinken J.B., Sijbrands E., Harms A.C., Hankemeier T., van Dijk K.W., et al. Metabolomics based markers predict type 2 diabetes in a 14-year follow-up study. Metabolomics. 2017;13:104. doi: 10.1007/s11306-017-1239-2. - DOI - PMC - PubMed
1. Sharma T., Shah M. A comprehensive review of machine learning techniques on diabetes detection. Vis. Comput. Ind. Biomed. Art. 2021;4:30. doi: 10.1186/s42492-021-00097-7. - DOI - PMC - PubMed
1. Artzi N.S., Shilo S., Hadar E., Rossman H., Barbash-Hazan S., Ben-Haroush A., Balicer R.D., Feldman B., Wiznitzer A., Segal E. Prediction of gestational diabetes based on nationwide electronic health records. Nat. Med. 2020;26:71–76. doi: 10.1038/s41591-019-0724-8. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Affiliations

Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Medical

Research Materials