Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 19:15:11.
doi: 10.4103/jmss.jmss_29_24. eCollection 2025.

Two Machine-learning Hybrid Models for Predicting Type 2 Diabetes Mellitus

Affiliations

Two Machine-learning Hybrid Models for Predicting Type 2 Diabetes Mellitus

Rahman Farnoosh et al. J Med Signals Sens. .

Abstract

Background: The global increase in diabetes prevalence necessitates advanced diagnostic methods. Machine learning has shown promise in disease diagnosis, including diabetes.

Materials and methods: We used a dataset collected from the Medical City Hospital laboratory and the Specialized Center for Endocrinology and Diabetes at Al-Kindy Teaching Hospital in Iraq. This dataset includes 1000 physical examination samples from both male and female patients. The samples are categorized into three classes: diabetic (Y), nondiabetic (N), and predicted diabetic (P). The dataset contains twelve attributes and includes outlier data. Outliers in medical studies can result from unusual disease attributes. Therefore, consulting with a specialist physician to identify and handle these outliers using statistical methods is necessary. The main contribution of this study is the proposal of two hybrid models for diabetes diagnosis in two scenarios: (1) Scenario 1 (presence of outlier data): Hybrid Model 1 combines the K-medoids clustering algorithm with a Gaussian naive Bayes (GNB) classifier based on kernel density estimation (KDE) to handle outliers and (2) Scenario 2 (after removing outlier data): Hybrid Model 2 combines the K-means clustering algorithm with a GNB classifier based on KDE with suitable bandwidth. We performed principal component analysis to minimize dimensionality and evaluated the models using fivefold cross-validation.

Results: All experiments were conducted in identical settings. Our proposed hybrid models demonstrated superior performance in two scenarios, handling and rejecting outliers, compared to other machine-learning models in this study, including support vector machines (with radial-based, polynomial, linear, and sigmoid kernel functions), decision trees (J48), and GNB classifiers for diabetes prediction. The average accuracy for Scenario 1 with Hybrid Model 1 was 0.9743, and for Scenario 2 with Hybrid Model 2, it was 0.9867. We also evaluated precision, sensitivity, and F1-score as performance metrics.

Conclusion: This study presents two hybrid models for diabetes diagnosis, demonstrating high accuracy in distinguishing between diabetic and nondiabetic patients and effectively handling outliers. The findings highlight the potential of machine-learning techniques for improving the early diagnosis and treatment of diabetes.

Keywords: Decision tree; Gaussian naive Bayes; K-means; K-medoids; diabetes mellitus prediction; kernel density estimation; support vector machine.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts of interest.

Figures

Figure 1
Figure 1
(a). An example of a classification problem of 3 classes: green, red, and blue. (b). The red-blue line is generated via a one-to-one technique to maximize the distance between the blue and red spots. Green points have nothing to do with it
Figure 2
Figure 2
The Iraqi Patient Dataset for Diabetes dataset population distribution of all attributes, where the green, blue, and yellow color distributions indicate diabetic (C0) individuals, nondiabetic (C1) individuals, and predicted diabetic (C2) individuals, respectively
Figure 3
Figure 3
M1. In the first scenario, which involves the presence of outliers. Thus, we used K-medoids clustering methods and then naive Bayes-based Kernel density estimation
Figure 4
Figure 4
M2. In the second scenario, after outlier rejection, we used K-means clustering and then naive Bayes-based kernel density estimation
Figure 5
Figure 5
Results from using the K-nearest-neighbor algorithm to fill in missing and empty data
Figure 6
Figure 6
Outliers associated with low-density lipoprotein
Figure 7
Figure 7
(a) The Kernel density estimation performance is better than that of the normal distribution. For the AGE attribute and (b) for the TG attribute. In addition, we see that both attributes have different distributions in different regions. In addition, we used a clustering method to divide the data of a class into several clusters with the same statistical similarity in each cluster
Figure 8
Figure 8
Diabetes can be predicted using all of these features (13). (a). In the first scenario, that is, the presence of an outlier, M1 had a greater AAC in predicting diabetes than did the other models. (b). In the second scenario, after outlier rejection, M2 had a greater AAC for predicting diabetes than did the other models
Figure 9
Figure 9
Diabetes can be predicted using all features without blood glucose (12). (a). In the first scenario, that is, the presence of an outlier, M1 had a greater AAC in predicting diabetes than did the other models. (b). In the second scenario, after outlier rejection, M2 had a greater AAC in predicting diabetes than did the other models, and compared to the other experiments, in this case, the AAC decreased
Figure 10
Figure 10
Diabetes can be predicted by using principal component analysis to reduce dimensionality. Compared to other experiments, in this case, the highest AAC value is obtained. (a) In the first scenario, that is, the presence of an outlier, M1 had a greater AAC in predicting diabetes than did the other models. (b). In the second scenario, after outlier rejection, M2 had a greater AAC in predicting diabetes than did the other models, and compared to the other experiments, in this case, the AAC decreased
Figure 11
Figure 11
Changes in the number of first-class clusters can have an impact on the performance of the two models in different scenarios. The number of first- and second-class clusters has less impact than the number of first-class clusters
Figure 12
Figure 12
Comparison of diabetes prediction models in terms of AAC performance criteria

References

    1. Hezagirwa B, Riewpaiboon A, Chanjaruporn F. Exploring cost drivers to improve disease management: The case of type 2 diabetes at a tertiary hospital in Burundi, Africa. J Public Health Afr. 2023;14:2266. - PMC - PubMed
    1. Żuchnik M, Rybkowska A, Szczuraszek P, Szczuraszek H, Bętkowska P, Radulski J, et al. Type 2 diabetes-factors of occurrence and its complications. Qual Sport. 2023;10:32–40.
    1. Beljić ZT. Prediabetes: From diagnosis to prognosis. Galenika Med J. 2022;1:57–61.
    1. Wiesmann UN, DiDonato S, Herschkowitz NN. Effect of chloroquine on cultured fibroblasts: Release of lysosomal hydrolases and inhibition of their uptake. Biochem Biophys Res Commun. 1975;66:1338–43. - PubMed
    1. Kant R, Davis A, Verma V. Maturity-onset diabetes of the young: Rapid evidence review. Am Fam Physician. 2022;105:162–7. - PubMed