Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep;63(9):2733-2752.
doi: 10.1007/s11517-025-03355-5. Epub 2025 Apr 8.

A machine learning approach for type 2 diabetes diagnosis and prognosis using tailored heterogeneous feature subsets

Affiliations

A machine learning approach for type 2 diabetes diagnosis and prognosis using tailored heterogeneous feature subsets

J Ramón Navarro-Cerdán et al. Med Biol Eng Comput. 2025 Sep.

Abstract

Type 2 diabetes (T2D) is becoming one of the leading health problems in Western societies, diminishing quality of life and consuming a significant share of healthcare resources. This study presents machine learning models for T2D diagnosis and prognosis, developed using heterogeneous data from a Spanish population dataset (Di@bet.es study). The models were trained exclusively on individuals classified as controls and undiagnosed diabetics, ensuring that the results are not influenced by treatment effects or behavioral changes due to disease awareness. Two data domains are considered: environmental (patient lifestyle questionnaires and measurements) and clinical (biochemical and anthropometric measurements). The preprocessing pipeline consists of four key steps: geospatial data extraction, feature engineering, missing data imputation, and quasi-constancy filtering. Two working scenarios (Environmental and Healthcare) are defined based on the features used, and applied to two targets (diagnosis and prognosis), resulting in four distinct models. The feature subsets that best predict the target have been identified based on permutation importance and sequential backward selection, reducing the number of features and, consequently, the cost of predictions. In the Environmental scenario, models achieved an AUROC of 0.86 for diagnosis and 0.82 for prognosis. The Healthcare scenario performed better, with an AUROC of 0.96 for diagnosis and 0.88 for prognosis. A partial dependence analysis of the most relevant features is also presented. An online demo page showcasing the Environmental and Healthcare T2D prognosis models is available upon request.

Keywords: Diagnosis and prognosis risk estimation; Feature selection; Geospatial data augmentation; Heterogeneous missing data imputation; Quasi-constancy heuristic; Type 2 diabetes mellitus.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethical approval: Ethics Committee of Valencian Clinical Hospital gave ethical approval for this work (references 2017.184 and 2031/036). Consent to participate: Written informed consent in accordance with the recommendations of the Declaration of Human Rights, the Conference of Helsinki, and institutional regulations were obtained from all patients. Conflict of interest: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Main sequence of processing steps applied to all data. Feature-dependent processes were applied, such as removing redundancy among features, data imputation using univariate or multivariate methods, and quasi-constancy filtering
Fig. 2
Fig. 2
Example of geospatial features: NO2 (left) and As (right) pollution in the Spain territory. Geospatial features are fully imputed through an interpolation of the available external geospatial data [25] using Expression (2)
Fig. 3
Fig. 3
Missingness matrix of the Di@bet.es dataset after performing feature engineering. Each column represents a feature, and each row represents an observation. White spaces indicate missing values, while gray color represents observed values. Only features with a missingness fraction equal or greater than 1% are displayed. Rows were reordered using hierarchical clustering based on the presence of missing values to enhance the visualization of potential missingness patterns or correlations among features
Fig. 4
Fig. 4
Models for diagnosis (D) and prognosis (P) are built for each scenario (ENV, HEA). The figure outlines the individuals and features that take part in each model
Fig. 5
Fig. 5
Left: top PI features in the Diagnosis-Environmental (D-ENV) scenario, sorted by importance. Boxplot colors indicate the sign of Spearman’s r if p-value <0.05. Right: 95% CI of mean PI, marking in blue those strictly in R+
Fig. 6
Fig. 6
Left: top PI features in the Diagnosis-Healthcare (D-HEA) scenario, sorted by importance. Boxplot colors indicate the sign of Spearman’s r if p-value <0.05. Right: 95% CI of mean PI, marking in blue those strictly in R+
Fig. 7
Fig. 7
Left: top PI features in the Prognosis-Environmental (P-ENV) scenario, sorted by importance. Boxplot colors indicate the sign of Spearman’s r if p-value <0.05. Right: 95% CI of mean PI, marking in blue those strictly in R+
Fig. 8
Fig. 8
Left: top PI features in the Prognosis-Healthcare (P-HEA) scenario, sorted by importance. Boxplot colors indicate the sign of Spearman’s r if p-value <0.05. Right: 95% CI of mean PI, marking in blue those strictly in R+
Fig. 9
Fig. 9
10-fold cross-validation averaged ROC curves obtained from the Diagnosis-Environmental (D-ENV, left) and Diagnosis-Healthcare (D-HEA, right) models
Fig. 10
Fig. 10
10-fold cross-validation averaged ROC curves obtained from the Prognosis-Environmental (P-ENV, left) and Prognosis-Healthcare (P-HEA, right) models
Fig. 11
Fig. 11
Partial dependence plots (PDP) reflecting univariate feature contribution to T2D development according to the Prognosis-Environmental (P-ENV) model. The Y-axis represents the partial dependence, or average expected target response, while the X-axis displays the values of each corresponding feature
Fig. 12
Fig. 12
Partial dependence plots (PDP) reflecting univariate feature contribution to T2D development according to the Prognosis-Healthcare (P-HEA) model. The Y-axis represents the partial dependence, or average expected target response, while the X-axis displays the values of each corresponding feature

References

    1. World health organization (2023) Health topics: diabetes. https://www.who.int/health-topics/diabetes. Accessed 14 Nov 2023
    1. Ong KL, Stafford LK, McLaughlin SA, Boyko EJ, Vollset SE, Smith AE, Dalton BE, Duprey J, Cruz JA, Hagins H et al (2023). Lancet. 10.2139/ssrn.4478194
    1. Soriguer F, Goday A, Bosch-Comas A, Bordiú E, Calle-Pascual A, Carmena R, Casamitjana R, Castaño L, Castell C, Catalá M, Delgado E, Franch J, Gaztambide S, Girbés J, Gomis R, Gutiérrez G, López-Alba A, Martínez-Larrad MT, Menéndez E, Mora-Peces I, Ortega E, Pascual-Manich G, Rojo-Martínez G, Serrano-Rios M, Valdés S, Vázquez JA, Vendrell J (2012) Diabetologia 55(1):88. 10.1007/s00125-011-2336-9 - PMC - PubMed
    1. Rojo-Martínez G, Valdés S, Soriguer F, Vendrell J, Urrutia I, Pérez V, Ortega E, Ocón P, Montanya E, Menéndez E et al (2020) Nature Publishing Group UK London, vol 10, p 2765. 10.1038/s41598-020-59643-7
    1. Association AD (2023) Clinical Diabetes 41(1):4. 10.2337/cd23-as01

LinkOut - more resources