Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 20;10(1):6-18.
doi: 10.1177/1932296815620200.

Reverse Engineering and Evaluation of Prediction Models for Progression to Type 2 Diabetes: An Application of Machine Learning Using Electronic Health Records

Affiliations

Reverse Engineering and Evaluation of Prediction Models for Progression to Type 2 Diabetes: An Application of Machine Learning Using Electronic Health Records

Jeffrey P Anderson et al. J Diabetes Sci Technol. .

Abstract

Background: Application of novel machine learning approaches to electronic health record (EHR) data could provide valuable insights into disease processes. We utilized this approach to build predictive models for progression to prediabetes and type 2 diabetes (T2D).

Methods: Using a novel analytical platform (Reverse Engineering and Forward Simulation [REFS]), we built prediction model ensembles for progression to prediabetes or T2D from an aggregated EHR data sample. REFS relies on a Bayesian scoring algorithm to explore a wide model space, and outputs a distribution of risk estimates from an ensemble of prediction models. We retrospectively followed 24 331 adults for transitions to prediabetes or T2D, 2007-2012. Accuracy of prediction models was assessed using an area under the curve (AUC) statistic, and validated in an independent data set.

Results: Our primary ensemble of models accurately predicted progression to T2D (AUC = 0.76), and was validated out of sample (AUC = 0.78). Models of progression to T2D consisted primarily of established risk factors (blood glucose, blood pressure, triglycerides, hypertension, lipid disorders, socioeconomic factors), whereas models of progression to prediabetes included novel factors (high-density lipoprotein, alanine aminotransferase, C-reactive protein, body temperature; AUC = 0.70).

Conclusions: We constructed accurate prediction models from EHR data using a hypothesis-free machine learning approach. Identification of established risk factors for T2D serves as proof of concept for this analytical approach, while novel factors selected by REFS represent emerging areas of T2D research. This methodology has potentially valuable downstream applications to personalized medicine and clinical research.

Keywords: diabetes mellitus; disease progression; electronic health records; medical informatics; prediabetic state; type 2.

PubMed Disclaimer

Conflict of interest statement

Declaration of Conflicting Interests: The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: All authors were employed by either GNS Healthcare or Pfizer at the time this research was conducted.

Figures

Figure 1.
Figure 1.
Flow diagram describing restriction criteria for analytical study population applied to Humedica electronic health records data sample, 2007-2012. BG, blood glucose; IDN, integrated delivery network; UACR, urinary albumin to creatinine ratio.
Figure 2.
Figure 2.
Kaplan–Meier plots for time to T2D by selected (potentially modifiable) patient factors: baseline blood glucose measures, triglycerides, systolic blood pressure, and history of lipid disorders, Humedica electronic health records data sample, 2007-2012.
Figure 3.
Figure 3.
Receiver operating characteristic curves for accuracy of the REFS ensemble in predicting progression to diabetes (from normoglycemia) in the training and testing data sets, Humedica electronic health records data sample, 2007-2012.
Figure 4.
Figure 4.
Individual 3.5-year risk of diabetes for 2 selected patients, Humedica electronic health records data sample, 2007-2012.
Figure 5.
Figure 5.
Kaplan–Meier plots for time to prediabetes by selected patient factors, Humedica electronic health records data sample, 2007-2012.

References

    1. Inzucchi SE. Clinical practice: diagnosis of diabetes. N Engl J Med. 2012;367:542-550. - PubMed
    1. Gregg EW, Li Y, Wang J, et al. Changes in diabetes-related complications in the United States, 1990-2010. N Engl J Med. 2014;370:1514-1523. - PubMed
    1. Tabak AG, Herder C, Rathmann W, Brunner EJ, Kivimaki M. Prediabetes: a high-risk state for diabetes development. Lancet. 2012;379:2279-2290. - PMC - PubMed
    1. Collins GS, Mallett S, Omar O, Yu LM. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 2011;9:103. - PMC - PubMed
    1. Abbasi A, Peelen LM, Corpeleijn E, et al. Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. BMJ. 2012;345:e5900. - PMC - PubMed

Publication types