Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 18;15(1):4230.
doi: 10.1038/s41467-024-48618-1.

AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes

Affiliations

AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes

Yi-Jia Huang et al. Nat Commun. .

Abstract

Type 2 diabetes (T2D) presents a formidable global health challenge, highlighted by its escalating prevalence, underscoring the critical need for precision health strategies and early detection initiatives. Leveraging artificial intelligence, particularly eXtreme Gradient Boosting (XGBoost), we devise robust risk assessment models for T2D. Drawing upon comprehensive genetic and medical imaging datasets from 68,911 individuals in the Taiwan Biobank, our models integrate Polygenic Risk Scores (PRS), Multi-image Risk Scores (MRS), and demographic variables, such as age, sex, and T2D family history. Here, we show that our model achieves an Area Under the Receiver Operating Curve (AUC) of 0.94, effectively identifying high-risk T2D subgroups. A streamlined model featuring eight key variables also maintains a high AUC of 0.939. This high accuracy for T2D risk assessment promises to catalyze early detection and preventive strategies. Moreover, we introduce an accessible online risk assessment tool for T2D, facilitating broader applicability and dissemination of our findings.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Flowchart of genetic-centric analysis. A Data partitioning.
The dataset containing information from 60,747 individuals after data quality control (QC) was divided into several subsets: (i) The genome-wide association study (GWAS) samples (Dataset 1, N = 35,688), training samples (Dataset 2, N = 12,236; Dataset 4, N = 40,787), and validation samples (Dataset 3, N = 3060; Dataset 5, N = 10,197). For classification analysis, testing samples comprised Dataset 6 (N = 8827) and Dataset 7 (N = 936), while for prediction analysis, they were represented as Datasets 6’ (N = 8827) and Dataset 7’ (N = 936); B Sample size. Total sample size, along with the number of cases and the number of controls, are shown for each of the four phenotype definitions in Datasets 1 – 7; C Phenotype definition criteria. The definition and sample size for the four Type 2 Diabetes (T2D) phenotype definitions is shown. D Analysis flowchart. The analysis flow comprises three steps, starting with selecting T2D-associated single nucleotide polymorphisms (SNPs) and polygenic risk score (PRS), then selecting demographic and environmental covariates, and the best XGBoost model was established using the selected features. As to the first step, SNPs can be chosen from A our own GWAS with an adjustment for age, sex, and top ten principal components (PCs), B published studies based on single ethnic populations, and C published studies based on multiple ethnic populations. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. Flowchart of genetic-image integrative analysis. A Data partitioning and model training.
Phenotype Definition IV was used as an example to illustrate the process. The data containing information from 7,786 individuals were divided into four subsets: a training dataset (N = 4689), a validation dataset (N = 1175), and two independent testing datasets (N = 1469 for the first dataset and N = 444 for the second independent dataset). Subsequently, the best XGBoost model was established. B Flowchart of PRS construction. The Polygenic Risk Score (PRS) was constructed using PRS-CSx, utilizing genome-wide association study (GWAS) summary statistics from the European (EUR), East Asian (EAS), and South Asian (SAS) populations obtained from the analysis of the DIAGRAM Project. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Model evaluation and comparison.
A bar chart displays AUC. The two-sided DeLong test examined the difference between AUCs. Bonferroni’s correction was applied to control for a family-wise error rate in multiple comparisons. Symbols *, **, and *** indicate p-values < 0.05, 0.01, and 0.001, respectively. A SNP selection. Model predictors were SNPs selected from published studies or our GWAS under different p-value thresholds, where our GWAS association test is a two-sided Wald test for the slope coefficient in a logistic regression. The average AUCs of prediction models for four phenotype definitions were compared. B T2D Phenotype Definition. In addition to including the selected variables in Fig. 3A, the AUCs of four phenotype definitions were compared. C Family history of T2D. In addition to including the selected variables in Fig. 3A, B, the AUCs of the four types of T2D family history (i.e., (i): parents (binary factor), (ii) sibs (binary factor), (iii) either parents or sibs (binary factors), and (iv) both parents and sibs (ordinal factor)) were compared. D Demographic variables. In addition to including the selected variables in Fig. 3A–C, the AUCs of different combinations of demographic factors, including age, sex, and family history of T2D, are compared. E PRS and demographic variables. In addition to including the selected variables in Fig. 3A–D, the AUCs of different combinations of genetic variables, including SNPs, PRS-CS, and PRS-CSx, and demographic variables, including age, sex, and family history of T2D, are compared. F Impact of including PRS after SNPs. The AUCs of the models that consider SNPs, SNPs+PRS-CS, and SNPs+PRS-CSx as predictors are compared. G Impact of including additional SNPs after PRS. The additional 137 SNPs were collected from published studies (Supplemental Text 2). The AUCs of the models that consider additional SNPs given PRS in the model are compared. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Results in the genetic-centric analysis.
A AUCs of all models based on Phenotype Definition IV. A heatmap summarizes the AUCs of all models based on Phenotype Definition IV (i.e., T2D was defined by self-reported T2D, HbA1c, and fasting glucose). The genetic variables are shown on the X-axis, and the demographic variables are shown on the Y-axis. B Positive correlation between PRS and T2D odds ratio. In each decile of PRS based on PRS-CSx, the odds ratio of T2D risk and its 95% confidence interval were calculated based on an unadjusted model (blue line) and an adjusted model considering age, sex, and T2D family history (red line). The reference group was the PRS group in the 40–60% decile. The horizontal bars are presented as the odds ratio estimates (square symbol) +/– its 95% confidence intervals (left and right ends) at a PRS decile. C High-risk group. In the chart, the figures from the inner to the outer represent (i) the case-to-control ratio, (ii) the number of cases, and (iii) the number of controls in the PRS decile subgroups. D Association of age, sex, T2D family history, and PRS with T2D. In the univariate analysis, the p-values for age, sex, family history, and PRS were 4.17 × 10–20, 7.08 × 10–7, 9.41 × 10–13, and 2.06 × 10–13, respectively. In the multivariate analysis, the p-values for age, sex, family history, and PRS were 2.00 × 10–16, 5.56 × 10–5, 1.43 × 10–10, and 5.49 × 10–13, respectively. E Risk factors for T2D. Kaplan-Meier curves reveal that Age (older individuals), sex (males), T2D family history (the larger number of parents and siblings who had T2D), and PRS (high decile PRS group) are risk factors (high-risk level) for T2D risk. F Median event time of T2D. Examples of the median event time for developing T2D are provided based on a multivariate Cox regression model, both without and with incorporating PRS. NA indicates not assessable. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. T2D early detection using our prediction model (Phenotype Definition IV; age, sex, family, and PRS).
A Four subgroups (N = 550). B Survival rate (N = 550). C Median survival time (N = 550). P-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 0.092, 0.0014 (**), and 2.22 × 10–16 (***), respectively. D Follow-up time (N = 550). P-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 0.056, 0.32, and 0.14, respectively. E T2D risk (N = 550). P-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 0.018 (*), 0.073, and 0.0039 (**), respectively. F HbA1c (N = 550). P-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 2.21 × 10–14, 0.0039, and 3.00 × 10–5; respectively; in the follow-up, p-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 1.50 × 10-13, 6.01 × 10−4, and 4.55 × 10-6, respectively. G Fasting glucose (N = 550). In the baseline, p-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 2.06 × 10-12, 6.66 × 10–4, and 1.63 × 10–2; respectively; in the follow-up, p-values of G1 vs. G3, G2 vs. G3, and G4 vs. G3 were 8.30 × 10–8, 1.38 × 10–3, and 1.84 × 10–2, respectively. H Phenotype definition in G3 (N = 395). Many individuals in G3 cannot satisfy the T2D Phenotype Definition IV. In Fig. 5C–G, two-sided Wilcoxon rank-sum tests were applied to compare group differences. The box plots’ center lines indicate the medians, the lower and upper boundaries of the boxes represent the first and third quartiles, and the whiskers extend to cover a range of 1.5 interquartile distances from the edges. The violin plots’ upper and lower bounds depict the minimum and maximum values. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. Results in the genetic-image integrative analysis.
A Performance comparison of medical imaging data analysis. The area under the receiver operating characteristic (ROC) curve (AUC), accuracy (ACC), sensitivity (SEN), specificity (SPEC), and F1 score are compared for the integrative analysis of four types of medical images (All) and individual medical image analyses, including BMD, ECG, CAU, and ABD. B The model that combines four types of medical imaging, PRS, and demographic variables shows the highest AUC of 0.949. ROC plots and the corresponding AUC for the models considering medical image features (I), genetic PRS (G), and demographic variables, including age, sex, T2D family history (D), and their combinations. C An optimal model combining medical imaging, PRS, and demographic variables. The best model’s top 20 features with a high feature impact include the medical image, genetic, and demographic features. D Positive correlation between MRS and T2D odds ratio. In each decile of MRS based on four types of medical images, the odds ratio of T2D risk and its 95% confidence interval were calculated based on an unadjusted model (blue line) and an adjusted model considering age, sex, and T2D family history (red line), with the MRS group in the 40–60% decile serving as the reference group. The horizontal bars are presented as the odds ratio estimates (square symbol) +/– its 95% confidence intervals (left and right ends). E High-risk group. The figures from the inner to the outer in the chart display (i) the case-to-control ratio, (ii) the number of cases, and (iii) the number of controls in the MRS decile subgroups. F Input page of the online T2D prediction website. Personal information, including age, sex, family history of T2D, PRS, and MRS, is input to calculate T2D risk. PRS and MRS are optional, and a reference distribution is provided. G Output page of the online T2D prediction website. Source data are provided as a Source Data file.

References

    1. Laakso M. Biomarkers for type 2 diabetes. Mol. Metab. 2019;27:S139–S146. doi: 10.1016/j.molmet.2019.06.016. - DOI - PMC - PubMed
    1. Morrish NJ, Wang SL, Stevens LK, Fuller JH, Keen H. and the WHOMSG. Mortality and causes of death in the WHO multinational study of vascular disease in diabetes. Diabetologia. 2001;44:S14. doi: 10.1007/PL00002934. - DOI - PubMed
    1. Khan MAB, et al. Epidemiology of Type 2 Diabetes - Global burden of disease and forecasted trends. J. Epidemiol. Glob. Health. 2020;10:107–111. doi: 10.2991/jegh.k.191028.001. - DOI - PMC - PubMed
    1. Chen H-Y, Kuo S, Su P-F, Wu J-S, Ou H-T. Health care costs associated with macrovascular, microvascular, and metabolic complications of type 2 diabetes across time: estimates from a population-based cohort of more than 0.8 million individuals with up to 15 years of follow-up. Diabetes Care. 2020;43:1732–1740. doi: 10.2337/dc20-0072. - DOI - PMC - PubMed
    1. Prasad RB, Groop L. Genetics of Type 2 diabetes—pitfalls and possibilities. Genes. 2015;6:87–123. doi: 10.3390/genes6010087. - DOI - PMC - PubMed