Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Apr 6:2025.04.04.25325131.
doi: 10.1101/2025.04.04.25325131.

Inclusion bias affects common variant discovery and replication in a health-system linked biobank

Affiliations

Inclusion bias affects common variant discovery and replication in a health-system linked biobank

Aditya Pimplaskar et al. medRxiv. .

Abstract

Electronic Health Records (EHR) -linked biobanks have emerged as promising tools for precision medicine, enabling the integration of clinical and molecular data for individual risk assessment. Association studies performed in biobank studies can connect common genetic variation to clinical phenotypes, such as through the use of polygenic scores (PGS), which are starting to have utility in aiding clinician decision making. However, while biobanks aggregate large amounts of data effectively for such studies, most employ various opt-in consent protocols, and, as a result, are expected to be subject to participation and recruitment biases. The extent to which biases affect genetic analyses in biobanks remains unstudied. In this study, we quantify bias and evaluate its impact on genetic analyses, using the UCLA ATLAS Community Health Initiative as a case study. Our analyses reveal that a wide array of factors, particularly socio-demographic characteristics and healthcare utilization patterns, influence participation, effectively differentiating biobank participants from the broader patient population (AUROC = 0.85, AUPRC = 0.82). Through weighting the sample using inverse probability weights derived from probabilities of enrollment, we replicated 54% more known GWAS variants than models that did not take bias into account (e.g. associations between variants in the PPARG gene and type 2 diabetes). We further show that PGS-Phenome wide associations are affected by the weighting scheme, and suggest associations corroborated by weighted analyses to be more robust. Our results highlight that genetic analyses within biobanks should account for inclusion biases, and suggest inverse probability weighting as a potential approach.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Cohort characteristics of the UCLA health sample and ATLAS subsample.
(Left) Feature level distributions stratified by ATLAS enrollment (yellow = not enrolled in ATLAS, blue = enrolled in ATLAS). For quantitative variables, feature means are depicted, and for categorical features, proportion of individuals in both groups in the feature group are depicted. For diagnostic chapters, any diagnoses in the chapter is sufficient. (Middle/right) Results from univariate associations with ATLAS enrollment Odds ratio (middle) and Cox & Snell’s pseudo-R2 (right).
Figure 2:
Figure 2:. Probability distributions stratified by enrollment and feature characteristics from a multivariate random forest model classifier of ATLAS enrollment.
(Left) Predicted probability distributions stratified by true ATLAS enrollment status (yellow = not enrolled in ATLAS, blue = enrolled in ATLAS) show strong separation between the classes. (Right) Beeswarm plot of Shapley values on a subset of 100 individuals reveal healthcare utilization patterns and select ICD-10 diagnoses as important predictors of ATLAS enrollment.
Figure 3:
Figure 3:. Comparison of weighting schemes in phenome-wide associations with PGS.
Miami plots for PheWAS on PGS for MDD and BMI (top: unweighted, bottom: weighted) show shared and unique associations under the unweighted and weighted models.

Similar articles

References

    1. Clayton E. W. et al. Studying the impact of translational genomic research: Lessons from eMERGE. Am. J. Hum. Genet. 110, 1021–1033 (2023). - PMC - PubMed
    1. Abul-Husn N. S. & Kenny E. E. Personalized medicine and the power of electronic health records. Cell 177, 58–69 (2019). - PMC - PubMed
    1. Bowton E. et al. Biobanks and electronic medical records: enabling cost-effective research. Sci. Transl. Med. 6, 234cm3 (2014). - PMC - PubMed
    1. Wolford B. N., Willer C. J. & Surakka I. Electronic health records: the next wave of complex disease genetics. Hum. Mol. Genet. 27, R14–R21 (2018). - PMC - PubMed
    1. Phung L. et al. Facilitating return of actionable genetic research results from a biobank repository: Participant uptake and utilization of digital interventions. HGG Adv. 5, 100346 (2024). - PMC - PubMed

Publication types

LinkOut - more resources