Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 29;14(8):804.
doi: 10.3390/jpm14080804.

PyCaret for Predicting Type 2 Diabetes: A Phenotype- and Gender-Based Approach with the "Nurses' Health Study" and the "Health Professionals' Follow-Up Study" Datasets

Affiliations

PyCaret for Predicting Type 2 Diabetes: A Phenotype- and Gender-Based Approach with the "Nurses' Health Study" and the "Health Professionals' Follow-Up Study" Datasets

Sebnem Gul et al. J Pers Med. .

Abstract

Predicting type 2 diabetes mellitus (T2DM) by using phenotypic data with machine learning (ML) techniques has received significant attention in recent years. PyCaret, a low-code automated ML tool that enables the simultaneous application of 16 different algorithms, was used to predict T2DM by using phenotypic variables from the "Nurses' Health Study" and "Health Professionals' Follow-up Study" datasets. Ridge Classifier, Linear Discriminant Analysis, and Logistic Regression (LR) were the best-performing models for the male-only data subset. For the female-only data subset, LR, Gradient Boosting Classifier, and CatBoost Classifier were the strongest models. The AUC, accuracy, and precision were approximately 0.77, 0.70, and 0.70 for males and 0.79, 0.70, and 0.71 for females, respectively. The feature importance plot showed that family history of diabetes (famdb), never having smoked, and high blood pressure (hbp) were the most influential features in females, while famdb, hbp, and currently being a smoker were the major variables in males. In conclusion, PyCaret was used successfully for the prediction of T2DM by simplifying complex ML tasks. Gender differences are important to consider for T2DM prediction. Despite this comprehensive ML tool, phenotypic variables alone may not be sufficient for early T2DM prediction; genotypic variables could also be used in combination for future studies.

Keywords: PyCaret; SHAP value; feature importance plot; machine learning; prediction; type 2 diabetes mellitus.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Process flow diagram.
Figure 2
Figure 2
Learning curve for total (male + female) dataset.
Figure 3
Figure 3
Feature importance plot for female-only data subset.
Figure 4
Figure 4
Feature importance plot for male-only data subset.
Figure 5
Figure 5
Feature importance plot for total (female + male) dataset.
Figure 6
Figure 6
SHAP values for female-only data subset.
Figure 7
Figure 7
SHAP values for male-only data subset.
Figure 8
Figure 8
SHAP values for total (female + male) dataset.

References

    1. Hill-Briggs F., Adler N.E., Berkowitz S.A., Chin M.H., Gary-Webb T.L., Navas-Acien A., Thornton P.L., Haire-Joshu D. Social Determinants of Health and Diabetes: A Scientific Review. Diabetes Care. 2021;44:258–279. doi: 10.2337/dci20-0053. - DOI - PMC - PubMed
    1. Deberneh H.M., Kim I. Prediction of Type 2 Diabetes Based on Machine Learning Algorithm. Int. J. Environ. Res. Public Health. 2021;18:3317. doi: 10.3390/ijerph18063317. - DOI - PMC - PubMed
    1. Rajula H.S.R., Verlato G., Manchia M., Antonucci N., Fanos V. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina. 2020;56:455. doi: 10.3390/medicina56090455. - DOI - PMC - PubMed
    1. Bzdok D., Altman N., Krzywinski M. Statistics versus Machine Learning. Nat. Methods. 2018;15:233–234. doi: 10.1038/nmeth.4642. - DOI - PMC - PubMed
    1. Spooner A., Chen E., Sowmya A., Sachdev P., Kochan N.A., Trollor J., Brodaty H. A Comparison of Machine Learning Methods for Survival Analysis of High-Dimensional Clinical Data for Dementia Prediction. Sci. Rep. 2020;10:20410. doi: 10.1038/s41598-020-77220-w. - DOI - PMC - PubMed

LinkOut - more resources