Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2024 May 14;25(10):5331.
doi: 10.3390/ijms25105331.

Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Affiliations
Observational Study

Machine Learning Approach to Metabolomic Data Predicts Type 2 Diabetes Mellitus Incidence

Andreas Leiherer et al. Int J Mol Sci. .

Abstract

Metabolomics, with its wealth of data, offers a valuable avenue for enhancing predictions and decision-making in diabetes. This observational study aimed to leverage machine learning (ML) algorithms to predict the 4-year risk of developing type 2 diabetes mellitus (T2DM) using targeted quantitative metabolomics data. A cohort of 279 cardiovascular risk patients who underwent coronary angiography and who were initially free of T2DM according to American Diabetes Association (ADA) criteria was analyzed at baseline, including anthropometric data and targeted metabolomics, using liquid chromatography (LC)-mass spectroscopy (MS) and flow injection analysis (FIA)-MS, respectively. All patients were followed for four years. During this time, 11.5% of the patients developed T2DM. After data preprocessing, 362 variables were used for ML, employing the Caret package in R. The dataset was divided into training and test sets (75:25 ratio) and we used an oversampling approach to address the classifier imbalance of T2DM incidence. After an additional recursive feature elimination step, identifying a set of 77 variables that were the most valuable for model generation, a Support Vector Machine (SVM) model with a linear kernel demonstrated the most promising predictive capabilities, exhibiting an F1 score of 50%, a specificity of 93%, and balanced and unbalanced accuracies of 72% and 88%, respectively. The top-ranked features were bile acids, ceramides, amino acids, and hexoses, whereas anthropometric features such as age, sex, waist circumference, or body mass index had no contribution. In conclusion, ML analysis of metabolomics data is a promising tool for identifying individuals at risk of developing T2DM and opens avenues for personalized and early intervention strategies.

Keywords: ML; accuracy; artificial intelligence; diabetes; incidence; machine learning; metabolomics; support vector machine.

PubMed Disclaimer

Conflict of interest statement

No potential conflicts of interest relevant to this article are reported by A.L., A.M. (Axel Muendlein), S.M., C.H.S., A.M. (Arthur Mader), A.F., P.F., and H.D.

Figures

Figure 1
Figure 1
Identifying important variables by recursive feature elimination. Recursive feature elimination helps to identify important and less important variables and to define the optimal size of ML models, as summarized in Table 2. The plot represents the output of an RFE process generating different models (black dots). It depicts the relation between the different feature subset sizes (=number of available variables (1–362)) for modelling and the resulting performance metric (accuracy = (true positive + true negative)/(true positive + false positive + true negative + false negative)). Using Random Forest (left) and TreeBag algorithms (right) as functions in RFE, the best models (highlighted as blue dots) were calculated to have 77 and 362 variables, respectively. The process involves repeated cross-validation (method = repeatedcv (10-fold repeated 5 times)) to evaluate the performance of feature subsets. The plots were generated by ggplot using the Caret package in R (CRAN, R [11]).
Figure 2
Figure 2
Importance of model variables. The figure depicts the most important variables and the respective importance scores according to the “VarImp()” function in Caret (CRAN, R [11]). Here, the top 20 variables of the “svmLinear2” model are displayed, including hexoses, amino acids (glycine, isoleucine, tyrosine, valine), bile acids (chenodeoxycholic acid = CDCA, deoxycholic acid = DCA, ursodeoxycholic acid = UDCA, glycoursodeoxycholic acid = GUDCA, litocholic acid = LCA, cholic acid = CA), ceramides (N-C 18:1-Cer, N-C 14:0-Cer, N-C 15:0-Cer(H2)), energy metabolism intermediates (alpha-ketoglutaric acid, lactic acid), glycerophospholipids (PE aa C38:1, PC ae C38:5, PE ae C40:6), and a biogenic amine (kynurenine).
Figure 3
Figure 3
SHAP diagram of feature importance. The beeswarm plot illustrates the most important features (variables) and the contribution of these individual features to the model’s output using Shapley Additive Explanation (SHAP) values. Each dot represents a SHAP value for a feature and a specific data point, indicating the magnitude and direction of the feature’s impact on the model’s prediction relative to the baseline. The y-axis demonstrates the variable name, in order of importance from top to bottom, and the x-axis the SHAP value scale. It indicates how large the impact of the respective variables is on the model output (T2DM incidence). The gradient color indicates the original value for that variable. C5:1 represents tiglylcarnitine, C10:1 decenoylcarnitine, C16 hexadecanoylcarntine, C14:2 tetradecadienylcarnitine, GUDCA glycoursodeoxycholic acid, CA carnitine, PS aa C:34:2 a phosphatidylserine with a diacyl bond, and N-C13:0-Cer(2H) a dihydroceramide.

Similar articles

Cited by

References

    1. Thornton J.M., Shah N.M., Lillycrop K.A., Cui W., Johnson M.R., Singh N. Multigenerational diabetes mellitus. Front. Endocrinol. 2024;14:1245899. doi: 10.3389/fendo.2023.1245899. - DOI - PMC - PubMed
    1. Slieker R.C., Donnelly L.A., Akalestou E., Lopez-Noriega L., Melhem R., Güneş A., Azar F.A., Efanov A., Georgiadou E., Muniangi-Muhitu H., et al. Identification of biomarkers for glycaemic deterioration in type 2 diabetes. Nat. Commun. 2023;14:2533. doi: 10.1038/s41467-023-38148-7. - DOI - PMC - PubMed
    1. Liu J., Semiz S., van der Lee S.J., van der Spek A., Verhoeven A., van Klinken J.B., Sijbrands E., Harms A.C., Hankemeier T., van Dijk K.W., et al. Metabolomics based markers predict type 2 diabetes in a 14-year follow-up study. Metabolomics. 2017;13:104. doi: 10.1007/s11306-017-1239-2. - DOI - PMC - PubMed
    1. Sharma T., Shah M. A comprehensive review of machine learning techniques on diabetes detection. Vis. Comput. Ind. Biomed. Art. 2021;4:30. doi: 10.1186/s42492-021-00097-7. - DOI - PMC - PubMed
    1. Artzi N.S., Shilo S., Hadar E., Rossman H., Barbash-Hazan S., Ben-Haroush A., Balicer R.D., Feldman B., Wiznitzer A., Segal E. Prediction of gestational diabetes based on nationwide electronic health records. Nat. Med. 2020;26:71–76. doi: 10.1038/s41591-019-0724-8. - DOI - PubMed

Publication types