Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance
- PMID: 38455328
- PMCID: PMC10910909
- DOI: 10.3389/fepid.2022.871630
Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance
Abstract
Objective: To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR).
Study design and setting: Cox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set.
Results: We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83-0.84), while complete case analysis led to overestimation (calibration curve intercept: -0.52; 95% CI: -0.53 to -0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination.
Conclusion: Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.
Keywords: clinical prediction model; data preparation; electronic health records (EHRs); model performance; model transportability; prediction model.
Copyright © 2022 van Os, Kanning, Wermer, Chavannes, Numans, Ruigrok, van Zwet, Putter, Steyerberg and Groenwold.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures



Similar articles
-
Machine-learning Models Predict 30-Day Mortality, Cardiovascular Complications, and Respiratory Complications After Aseptic Revision Total Joint Arthroplasty.Clin Orthop Relat Res. 2022 Nov 1;480(11):2137-2145. doi: 10.1097/CORR.0000000000002276. Epub 2022 Jun 20. Clin Orthop Relat Res. 2022. PMID: 35767804 Free PMC article.
-
External Validation of a Prediction Model for Falls in Older People Based on Electronic Health Records in Primary Care.J Am Med Dir Assoc. 2022 Oct;23(10):1691-1697.e3. doi: 10.1016/j.jamda.2022.07.002. Epub 2022 Aug 10. J Am Med Dir Assoc. 2022. PMID: 35963283
-
Predicting need for advanced illness or palliative care in a primary care population using electronic health record data.J Biomed Inform. 2019 Apr;92:103115. doi: 10.1016/j.jbi.2019.103115. Epub 2019 Feb 10. J Biomed Inform. 2019. PMID: 30753951 Free PMC article.
-
Adult patient access to electronic health records.Cochrane Database Syst Rev. 2021 Feb 26;2(2):CD012707. doi: 10.1002/14651858.CD012707.pub2. Cochrane Database Syst Rev. 2021. PMID: 33634854 Free PMC article.
-
Multi-gene Pharmacogenomic Testing That Includes Decision-Support Tools to Guide Medication Selection for Major Depression: A Health Technology Assessment.Ont Health Technol Assess Ser. 2021 Aug 12;21(13):1-214. eCollection 2021. Ont Health Technol Assess Ser. 2021. PMID: 34484487 Free PMC article.
Cited by
-
Prediction of aneurysmal subarachnoid hemorrhage in comparison with other stroke types using routine care data.PLoS One. 2024 May 31;19(5):e0303868. doi: 10.1371/journal.pone.0303868. eCollection 2024. PLoS One. 2024. PMID: 38820263 Free PMC article.
-
Data Resource Profile: Extramural Leiden University Medical Center Academic Network (ELAN).Int J Epidemiol. 2024 Jun 12;53(4):dyae099. doi: 10.1093/ije/dyae099. Int J Epidemiol. 2024. PMID: 39049713 Free PMC article. No abstract available.
References
-
- Canadian Electronic Library P Canada Health Infoway . The Emerging Benefits of Electronic Medical Record Use in Community-Based Care: Full Report. Toronto, ON: Canada Health Infoway; (2013).
-
- Spasoff RA. Epidemiologic Methods for Health Policy. New York, NY: Oxford University Press I.
LinkOut - more resources
Full Text Sources