A machine learning approach for type 2 diabetes diagnosis and prognosis using tailored heterogeneous feature subsets
- PMID: 40198441
- PMCID: PMC12402034
- DOI: 10.1007/s11517-025-03355-5
A machine learning approach for type 2 diabetes diagnosis and prognosis using tailored heterogeneous feature subsets
Abstract
Type 2 diabetes (T2D) is becoming one of the leading health problems in Western societies, diminishing quality of life and consuming a significant share of healthcare resources. This study presents machine learning models for T2D diagnosis and prognosis, developed using heterogeneous data from a Spanish population dataset (Di@bet.es study). The models were trained exclusively on individuals classified as controls and undiagnosed diabetics, ensuring that the results are not influenced by treatment effects or behavioral changes due to disease awareness. Two data domains are considered: environmental (patient lifestyle questionnaires and measurements) and clinical (biochemical and anthropometric measurements). The preprocessing pipeline consists of four key steps: geospatial data extraction, feature engineering, missing data imputation, and quasi-constancy filtering. Two working scenarios (Environmental and Healthcare) are defined based on the features used, and applied to two targets (diagnosis and prognosis), resulting in four distinct models. The feature subsets that best predict the target have been identified based on permutation importance and sequential backward selection, reducing the number of features and, consequently, the cost of predictions. In the Environmental scenario, models achieved an AUROC of 0.86 for diagnosis and 0.82 for prognosis. The Healthcare scenario performed better, with an AUROC of 0.96 for diagnosis and 0.88 for prognosis. A partial dependence analysis of the most relevant features is also presented. An online demo page showcasing the Environmental and Healthcare T2D prognosis models is available upon request.
Keywords: Diagnosis and prognosis risk estimation; Feature selection; Geospatial data augmentation; Heterogeneous missing data imputation; Quasi-constancy heuristic; Type 2 diabetes mellitus.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethical approval: Ethics Committee of Valencian Clinical Hospital gave ethical approval for this work (references 2017.184 and 2031/036). Consent to participate: Written informed consent in accordance with the recommendations of the Declaration of Human Rights, the Conference of Helsinki, and institutional regulations were obtained from all patients. Conflict of interest: The authors declare no competing interests.
Figures
References
-
- World health organization (2023) Health topics: diabetes. https://www.who.int/health-topics/diabetes. Accessed 14 Nov 2023
-
- Ong KL, Stafford LK, McLaughlin SA, Boyko EJ, Vollset SE, Smith AE, Dalton BE, Duprey J, Cruz JA, Hagins H et al (2023). Lancet. 10.2139/ssrn.4478194
-
- Soriguer F, Goday A, Bosch-Comas A, Bordiú E, Calle-Pascual A, Carmena R, Casamitjana R, Castaño L, Castell C, Catalá M, Delgado E, Franch J, Gaztambide S, Girbés J, Gomis R, Gutiérrez G, López-Alba A, Martínez-Larrad MT, Menéndez E, Mora-Peces I, Ortega E, Pascual-Manich G, Rojo-Martínez G, Serrano-Rios M, Valdés S, Vázquez JA, Vendrell J (2012) Diabetologia 55(1):88. 10.1007/s00125-011-2336-9 - PMC - PubMed
-
- Rojo-Martínez G, Valdés S, Soriguer F, Vendrell J, Urrutia I, Pérez V, Ortega E, Ocón P, Montanya E, Menéndez E et al (2020) Nature Publishing Group UK London, vol 10, p 2765. 10.1038/s41598-020-59643-7
-
- Association AD (2023) Clinical Diabetes 41(1):4. 10.2337/cd23-as01
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
