Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 2:2:871630.
doi: 10.3389/fepid.2022.871630. eCollection 2022.

Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Affiliations

Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Hendrikus J A van Os et al. Front Epidemiol. .

Abstract

Objective: To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR).

Study design and setting: Cox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set.

Results: We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83-0.84), while complete case analysis led to overestimation (calibration curve intercept: -0.52; 95% CI: -0.53 to -0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination.

Conclusion: Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.

Keywords: clinical prediction model; data preparation; electronic health records (EHRs); model performance; model transportability; prediction model.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Graphic display of the study design. Graphic display of the study design. *Derivation sets (nine in total: one reference and eight variations) were derived from our original data set, with data preparation steps based on the predefined data preparation challenges.
Figure 2
Figure 2
Visualization of data density in Dutch primary care EHR (n = 89,491). This figure shows the data density in the EHR for the first year of follow-up of all included patients. The x-axis is divided into three different predictor groups: diagnoses (any type of ICPC registration), medications (any type of ATC registration), and laboratory or vital parameter measurements (any type of registration), with each dot representing an EHR registration data point. The y-axis represents the entire research population ranked from patients with most data points and descending.
Figure 3
Figure 3
Venn diagram with three different operationalizations for the outcome definition. This Venn diagram shows the numbers of first-ever main adverse cardiovascular event cases resulting from the different outcome definitions: ICPC only (brown; 4,505 cases), ICPC and ATC codes for event-specific medication (clopidogrel, ticagrelor, and dipyridamole) including acetylsalicylic acid (red; 4,505 + 2,231 cases) and ICPC and ATC codes for event-specific medication, excluding acetylsalicylic acid (brown + green; 4,505 + 160 cases).

Similar articles

Cited by

References

    1. Chaudhry B, Wang J, Wu S, Maglione M, Mojica W, Roth E, et al. . Systematic review: impact of health information technology on quality, efficiency, and costs of medical care. Ann Intern Med. (2006) 144:742–52. 10.7326/0003-4819-144-10-200605160-00125 - DOI - PubMed
    1. Canadian Electronic Library P Canada Health Infoway . The Emerging Benefits of Electronic Medical Record Use in Community-Based Care: Full Report. Toronto, ON: Canada Health Infoway; (2013).
    1. Ohno-Machado L. Sharing data from electronic health records within, across, and beyond healthcare institutions: current trends and perspectives. J Am Med Inform Assoc. (2018) 25:1113. 10.1093/jamia/ocy116 - DOI - PMC - PubMed
    1. Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. (2013) 309:1351–2. 10.1001/jama.2013.393 - DOI - PubMed
    1. Spasoff RA. Epidemiologic Methods for Health Policy. New York, NY: Oxford University Press I.

LinkOut - more resources