. 2020 Mar 1;27(3):396-406.

doi: 10.1093/jamia/ocz204.

A combined strategy of feature selection and machine learning to identify predictors of prediabetes

Kushan De Silva^{1

2}, Daniel Jönsson³, Ryan T Demmer⁴

Affiliations

¹ Department of Clinical Sciences, Faculty of Medicine, Lund University, Lund,Sweden.
² Department of General Practice, School of Primary and Allied Health Care, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Notting Hill, Australia.
³ Department of Periodontology, Malmö University, Malmö and Swedish Dental Service of Skane, Lund, Sweden.
⁴ Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, Minnesota, USA.

PMID: 31889178
PMCID: PMC7647289
DOI: 10.1093/jamia/ocz204

A combined strategy of feature selection and machine learning to identify predictors of prediabetes

Kushan De Silva et al. J Am Med Inform Assoc. 2020.

. 2020 Mar 1;27(3):396-406.

doi: 10.1093/jamia/ocz204.

Authors

Kushan De Silva^{1

2}, Daniel Jönsson³, Ryan T Demmer⁴

Affiliations

¹ Department of Clinical Sciences, Faculty of Medicine, Lund University, Lund,Sweden.
² Department of General Practice, School of Primary and Allied Health Care, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Notting Hill, Australia.
³ Department of Periodontology, Malmö University, Malmö and Swedish Dental Service of Skane, Lund, Sweden.
⁴ Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, Minnesota, USA.

PMID: 31889178
PMCID: PMC7647289
DOI: 10.1093/jamia/ocz204

Abstract

Objective: To identify predictors of prediabetes using feature selection and machine learning on a nationally representative sample of the US population.

Materials and methods: We analyzed n = 6346 men and women enrolled in the National Health and Nutrition Examination Survey 2013-2014. Prediabetes was defined using American Diabetes Association guidelines. The sample was randomly partitioned to training (n = 3174) and internal validation (n = 3172) sets. Feature selection algorithms were run on training data containing 156 preselected exposure variables. Four machine learning algorithms were applied on 46 exposure variables in original and resampled training datasets built using 4 resampling methods. Predictive models were tested on internal validation data (n = 3172) and external validation data (n = 3000) prepared from National Health and Nutrition Examination Survey 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUROC). Predictors were assessed by odds ratios in logistic models and variable importance in others. The Centers for Disease Control (CDC) prediabetes screening tool was the benchmark to compare model performance.

Results: Prediabetes prevalence was 23.43%. The CDC prediabetes screening tool produced 64.40% AUROC. Seven optimal (≥ 70% AUROC) models identified 25 predictors including 4 potentially novel associations; 20 by both logistic and other nonlinear/ensemble models and 5 solely by the latter. All optimal models outperformed the CDC prediabetes screening tool (P < 0.05).

Discussion: Combined use of feature selection and machine learning increased predictive performance outperforming the recommended screening tool. A range of predictors of prediabetes was identified.

Conclusion: This work demonstrated the value of combining feature selection with machine learning to identify a wide range of predictors that could enhance prediabetes prediction and clinical decision-making.

Keywords: NHANES; feature selection; machine learning; prediabetes; predictors.

PubMed Disclaimer

Figures

**Figure 1.**
Feature selection using Boruta algorithm: Variable importance plot. Default functions of the “Boruta” R package were used; feature importance measure = mean decrease accuracy, maximal number of random forest runs = 100. Red, yellow, green, and blue boxplots represent Z scores of rejected, tentative, confirmed and shadow attributes respectively. Shadow (minimum, mean, and maximum) features are reference points for deciding which attributes are truly important and these values are generated by the algorithm via shuffling values of the original attributes. Variables extracted from the 20 confirmed and the 10 tentative features selected by the “Boruta” algorithm are given in Table 1. (shadowMin=Minimum shadow score, cpk=creatine phosphokinase, psoriasis=diagnosed psoriasis, milk=milk consumption, mi=diagnosed heart attack, hepc=hepatitis C, basop=basophil count, copd=diagnosed chronic obstructive pulmonary disease, ocp=oral contraceptive use, ldh=lactate dehydrogenase, healthdev=self-rated health trend, wbc=white cell count, citizen=citizenship status, gdm=gestational diabetes, cuttingsalt=reducing salt intake, edu=education, rbc=red cell count, armc=arm circumference)

**Figure 2.**
Feature selection using recursive feature elimination. A random forest classifier with two-fold cross-validation was specified with other default functions of the “caret” package in R to extract features via recursive feature elimination. Variables extracted from the 30 most important features selected by the recursive feature elimination algorithm are given in Table 1.

See this image and copyright information in PMC

Cited by

Hyperglycemia screening based on survey data: an international instrument based on WHO STEPs dataset.
Moradifar P, Amini H, Amiri MM. Moradifar P, et al. BMC Endocr Disord. 2022 Dec 14;22(1):316. doi: 10.1186/s12902-022-01222-0. BMC Endocr Disord. 2022. PMID: 36514025 Free PMC article.
Machine learning for diabetes clinical decision support: a review.
Tuppad A, Patil SD. Tuppad A, et al. Adv Comput Intell. 2022;2(2):22. doi: 10.1007/s43674-022-00034-y. Epub 2022 Apr 13. Adv Comput Intell. 2022. PMID: 35434723 Free PMC article. Review.
Machine Learning-Based Prediction of Binge Drinking among Adults in the United State: Analysis of the 2022 Health Information National Trends Survey.
Huang X, Dai Z, Wang K, Luo X. Huang X, et al. Proc 2024 9th Int Conf Math Artif Intell (2024). 2024 May;2024:1-10. doi: 10.1145/3670085.3670090. Epub 2024 Aug 22. Proc 2024 9th Int Conf Math Artif Intell (2024). 2024. PMID: 39834720 Free PMC article.
Environmental exposures in machine learning and data mining approaches to diabetes etiology: A scoping review.
Mistry S, Riches NO, Gouripeddi R, Facelli JC. Mistry S, et al. Artif Intell Med. 2023 Jan;135:102461. doi: 10.1016/j.artmed.2022.102461. Epub 2022 Nov 30. Artif Intell Med. 2023. PMID: 36628796 Free PMC article.
Machine Learning Applications in Endocrinology and Metabolism Research: An Overview.
Hong N, Park H, Rhee Y. Hong N, et al. Endocrinol Metab (Seoul). 2020 Mar;35(1):71-84. doi: 10.3803/EnM.2020.35.1.71. Endocrinol Metab (Seoul). 2020. PMID: 32207266 Free PMC article. Review.

See all "Cited by" articles

References

1. Huang Y, Cai X, Mai W, et al. Association between prediabetes and risk of cardiovascular disease and all-cause mortality: systematic review and meta-analysis. BMJ 2016; 355: i5953.. - PMC - PubMed
1. Huang Y, Cai X, Qiu M, et al. Prediabetes and the risk of cancer: a meta-analysis. Diabetologia 2014; 57 (11): 2261–9. - PubMed
1. Edwards CM, Cusi K.. Prediabetes: a worldwide epidemic. Endocrinol Metab Clin North Am 2016; 45 (4): 751–64. - PubMed
1. Bansal N. Prediabetes diagnosis and treatment: a review. World J Diabetes 2015; 6 (2): 296–303. - PMC - PubMed
1. Dall TM, Narayan KV, Gillespie KB, et al. Detecting type 2 diabetes and prediabetes among asymptomatic adults in the United States: modeling American Diabetes Association versus US Preventive Services Task Force diabetes screening guidelines. Popul Health Metr 2014; 12 (1): 12. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A combined strategy of feature selection and machine learning to identify predictors of prediabetes

Affiliations

A combined strategy of feature selection and machine learning to identify predictors of prediabetes

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Medical