Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study
- PMID: 39568629
- PMCID: PMC11578099
- DOI: 10.1136/bmjph-2024-001666
Addressing Selection Biases within Electronic Health Record Data for Estimation of Diabetes Prevalence among New York City Young Adults: A Cross-Sectional Study
Abstract
Introduction: There is growing interest in using electronic health records (EHRs) for chronic disease surveillance. However, these data are convenience samples of in-care individuals, which are not representative of target populations for public health surveillance, generally defined, for the relevant period, as resident populations within city, state, or other jurisdictions. We focus on using EHR data for estimation of diabetes prevalence among young adults in New York City, as rising diabetes burden in younger ages call for better surveillance capacity.
Methods: This article applies common nonprobability sampling methods, including raking, post-stratification, and multilevel regression with post-stratification, to real and simulated data for the cross-sectional estimation of diabetes prevalence among those aged 18-44 years. Within real data analyses, we externally validate city- and neighborhood-level EHR-based estimates to gold-standard estimates from a local health survey. Within data simulations, we probe the extent to which residual biases remain when selection into the EHR sample is non-ignorable.
Results: Within the real data analyses, these methods reduced the impact of selection biases in the citywide prevalence estimate compared to gold standard. Residual biases remained at the neighborhood-level, where prevalence tended to be overestimated, especially in neighborhoods where a higher proportion of residents were captured in the sample. Simulation results demonstrated these methods may be sufficient, except when selection into the EHR is non-ignorable, depending on unmeasured factors or on diabetes status.
Conclusions: While EHRs offer potential to innovate on chronic disease surveillance, care is needed when estimating prevalence for small geographies or when selection is non-ignorable.
Keywords: diabetes mellitus; electronic health records; prevalence; selection bias; surveillance.
Conflict of interest statement
Competing Interests: The authors declare no competing interests.
Figures



Similar articles
-
The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780. Cochrane Database Syst Rev. 2024. PMID: 39679851 Free PMC article.
-
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.Cochrane Database Syst Rev. 2008 Jul 16;(3):CD001230. doi: 10.1002/14651858.CD001230.pub2. Cochrane Database Syst Rev. 2008. PMID: 18646068
-
Interventions targeted at women to encourage the uptake of cervical screening.Cochrane Database Syst Rev. 2021 Sep 6;9(9):CD002834. doi: 10.1002/14651858.CD002834.pub3. Cochrane Database Syst Rev. 2021. PMID: 34694000 Free PMC article.
-
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2. Cochrane Database Syst Rev. 2020. PMID: 33075160 Free PMC article.
-
Incentives for preventing smoking in children and adolescents.Cochrane Database Syst Rev. 2017 Jun 6;6(6):CD008645. doi: 10.1002/14651858.CD008645.pub3. Cochrane Database Syst Rev. 2017. PMID: 28585288 Free PMC article.
Cited by
-
A practical guide for nephrologist peer reviewers: evaluating artificial intelligence and machine learning research in nephrology.Ren Fail. 2025 Dec;47(1):2513002. doi: 10.1080/0886022X.2025.2513002. Epub 2025 Jul 7. Ren Fail. 2025. PMID: 40620096 Free PMC article.
References
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous