. 2024;2(2):e001666.

doi: 10.1136/bmjph-2024-001666.

Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study

Sarah Conderino¹, Lorna E Thorpe¹, Jasmin Divers^{1

2}, Sandra S Albrecht³, Shannon M Farley⁴, David C Lee¹, Rebecca Anthopolos¹

Affiliations

¹ Department of Population Health, NYU Grossman School of Medicine, New York, NY, USA.
² Department of Foundations of Medicine, NYU Long Island School of Medicine, Mineola, NY, USA.
³ Department of Epidemiology, Mailman School of Public Health at Columbia University, New York, NY, USA.
⁴ ICAP at Columbia University, New York City, New York, USA.

PMID: 39568629
PMCID: PMC11578099
DOI: 10.1136/bmjph-2024-001666

Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study

Sarah Conderino et al. BMJ Public Health. 2024.

. 2024;2(2):e001666.

doi: 10.1136/bmjph-2024-001666.

Authors

Sarah Conderino¹, Lorna E Thorpe¹, Jasmin Divers^{1

2}, Sandra S Albrecht³, Shannon M Farley⁴, David C Lee¹, Rebecca Anthopolos¹

Affiliations

¹ Department of Population Health, NYU Grossman School of Medicine, New York, NY, USA.
² Department of Foundations of Medicine, NYU Long Island School of Medicine, Mineola, NY, USA.
³ Department of Epidemiology, Mailman School of Public Health at Columbia University, New York, NY, USA.
⁴ ICAP at Columbia University, New York City, New York, USA.

PMID: 39568629
PMCID: PMC11578099
DOI: 10.1136/bmjph-2024-001666

Abstract

Introduction: There is growing interest in using electronic health records (EHRs) for chronic disease surveillance. However, these data are convenience samples of in-care individuals, which are not representative of target populations for public health surveillance, generally defined, for the relevant period, as resident populations within city, state, or other jurisdictions. We focus on using EHR data for estimation of diabetes prevalence among young adults in New York City, as rising diabetes burden in younger ages call for better surveillance capacity.

Methods: This article applies common nonprobability sampling methods, including raking, post-stratification, and multilevel regression with post-stratification, to real and simulated data for the cross-sectional estimation of diabetes prevalence among those aged 18-44 years. Within real data analyses, we externally validate city- and neighborhood-level EHR-based estimates to gold-standard estimates from a local health survey. Within data simulations, we probe the extent to which residual biases remain when selection into the EHR sample is non-ignorable.

Results: Within the real data analyses, these methods reduced the impact of selection biases in the citywide prevalence estimate compared to gold standard. Residual biases remained at the neighborhood-level, where prevalence tended to be overestimated, especially in neighborhoods where a higher proportion of residents were captured in the sample. Simulation results demonstrated these methods may be sufficient, except when selection into the EHR is non-ignorable, depending on unmeasured factors or on diabetes status.

Conclusions: While EHRs offer potential to innovate on chronic disease surveillance, care is needed when estimating prevalence for small geographies or when selection is non-ignorable.

Keywords: diabetes mellitus; electronic health records; prevalence; selection bias; surveillance.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors declare no competing interests.

Figures

Figure 1. Simulation study directed acyclic graph with baseline OR associations. Observed diabetes within those selected into the EHR sample; scenario 1 (orange): modified the level of misclassification of the auxiliary variable W compared with the unobserved variable U at levels equivalent to 10%, 30%, 50%, 70% and 90% misclassification; scenario 2 (purple): modified the association between diabetes and selection at OR levels of 0.33, 0.67, 1.0, 1.5 and 3.0. DM, diabetes mellitus; EHR, electronic health record; HIS, Hispanic; NHB, non-Hispanic Black; OTH, Other race.

Figure 2. Characterisation of the NYU Langone patient sample and comparison of NYU EHR-based to gold standard diabetes prevalence estimates for young adults aged 18–44 years by New York City PUMA neighbourhood. (A) Proportion of general population captured within the EHR sample by NYC PUMA, calculated by dividing NYU Langone patient counts by the total NYC PUMA population estimates from the American Community Survey 2019 5-year data, obtained through IPUMS USA. (B) Comparison of NYU EHR-based to gold standard diabetes prevalence estimates. Each point represents a PUMA neighbourhood. EHR estimates are defined using NYU Langone Health 2019 data. The gold standard estimate is defined using NYC CHS 2015–2020 data. (C) Comparison of relative bias in NYU EHR-based prevalence estimates versus proportion of the general population captured within the EHR sample. Relative bias is calculated as the per cent change between the gold standard and EHR-based prevalence estimate for each NYC PUMA neighbourhood. CHS, Community Health Survey; EHR, electronic health record; MLRP, multilevel regression with post-stratification; NYC, New York City; NYU, NYU Langone Health; PUMA, Public Use Microdata Area.

Figure 3. Mean relative bias in the EHR-based estimates versus true diabetes prevalence by simulation scenario. Error bars represent SD in mean relative bias across simulations. (A) Scenario 1 modified the level of misclassification of the auxiliary variable W compared with the unobserved variable U; (B) scenario 2 modified the association between diabetes and selection (OR_DM). EHR, electronic health record; MLRP, multilevel regression with post-stratification.

See this image and copyright information in PMC

Cited by

A practical guide for nephrologist peer reviewers: evaluating artificial intelligence and machine learning research in nephrology.
Wang Y, Cheungpasitporn W, Ali H, Qing J, Thongprayoon C, Kaewput W, Soliman KM, Huang Z, Yang M, Zhang Z. Wang Y, et al. Ren Fail. 2025 Dec;47(1):2513002. doi: 10.1080/0886022X.2025.2513002. Epub 2025 Jul 7. Ren Fail. 2025. PMID: 40620096 Free PMC article.

References

1. Perlman SE. Use and Visualization of Electronic Health Record Data to Advance Public Health. Am J Public Health. 2021;111:180–2. doi: 10.2105/AJPH.2020.306073. - DOI - PMC - PubMed
1. Kruse CS, Stein A, Thomas H, et al. The use of Electronic Health Records to Support Population Health: A Systematic Review of the Literature. J Med Syst. 2018;42:1–16. doi: 10.1007/s10916-018-1075-6. - DOI - PMC - PubMed
1. Queenan JA, Williamson T, Khan S, et al. Representativeness of patients and providers in the Canadian Primary Care Sentinel Surveillance Network: a cross-sectional study. CMAJ Open. 2016;4:E28–32. - PMC - PubMed
1. Romo ML, Chan PY, Lurie-Moroni E, et al. Characterizing Adults Receiving Primary Medical Care in New York City: Implications for Using Electronic Health Records for Chronic Disease Surveillance. Prev Chronic Dis. 2016;13:E56. doi: 10.5888/pcd13.150500. - DOI - PMC - PubMed
1. Bower JK, Patel S, Rudy JE, et al. Addressing Bias in Electronic Health Record-Based Surveillance of Cardiovascular Disease Risk: Finding the Signal Through the Noise. Curr Epidemiol Rep. 2017;4:346–52. doi: 10.1007/s40471-017-0130-z. - DOI - PMC - PubMed

Grants and funding

U18 DP006510/DP/NCCDPHP CDC HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study

Affiliations

Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous