This is a preprint.
Inclusion bias affects common variant discovery and replication in a health-system linked biobank
- PMID: 40236437
- PMCID: PMC11998835
- DOI: 10.1101/2025.04.04.25325131
Inclusion bias affects common variant discovery and replication in a health-system linked biobank
Abstract
Electronic Health Records (EHR) -linked biobanks have emerged as promising tools for precision medicine, enabling the integration of clinical and molecular data for individual risk assessment. Association studies performed in biobank studies can connect common genetic variation to clinical phenotypes, such as through the use of polygenic scores (PGS), which are starting to have utility in aiding clinician decision making. However, while biobanks aggregate large amounts of data effectively for such studies, most employ various opt-in consent protocols, and, as a result, are expected to be subject to participation and recruitment biases. The extent to which biases affect genetic analyses in biobanks remains unstudied. In this study, we quantify bias and evaluate its impact on genetic analyses, using the UCLA ATLAS Community Health Initiative as a case study. Our analyses reveal that a wide array of factors, particularly socio-demographic characteristics and healthcare utilization patterns, influence participation, effectively differentiating biobank participants from the broader patient population (AUROC = 0.85, AUPRC = 0.82). Through weighting the sample using inverse probability weights derived from probabilities of enrollment, we replicated 54% more known GWAS variants than models that did not take bias into account (e.g. associations between variants in the PPARG gene and type 2 diabetes). We further show that PGS-Phenome wide associations are affected by the weighting scheme, and suggest associations corroborated by weighted analyses to be more robust. Our results highlight that genetic analyses within biobanks should account for inclusion biases, and suggest inverse probability weighting as a potential approach.
Figures



Similar articles
-
The UCLA ATLAS Community Health Initiative: Promoting precision health research in a diverse biobank.Cell Genom. 2023 Jan 11;3(1):100243. doi: 10.1016/j.xgen.2022.100243. eCollection 2023 Jan 11. Cell Genom. 2023. PMID: 36777178 Free PMC article.
-
To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks.medRxiv [Preprint]. 2024 Feb 13:2024.02.12.24302710. doi: 10.1101/2024.02.12.24302710. medRxiv. 2024. Update in: J Am Med Inform Assoc. 2024 Jun 20;31(7):1479-1492. doi: 10.1093/jamia/ocae098. PMID: 38405832 Free PMC article. Updated. Preprint.
-
Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting.medRxiv [Preprint]. 2024 Oct 29:2024.10.28.24316286. doi: 10.1101/2024.10.28.24316286. medRxiv. 2024. PMID: 39574876 Free PMC article. Preprint.
-
Scalable and Robust Regression Methods for Phenome-Wide Association Analysis on Large-Scale Biobank Data.Front Genet. 2021 Jun 15;12:682638. doi: 10.3389/fgene.2021.682638. eCollection 2021. Front Genet. 2021. PMID: 34211504 Free PMC article. Review.
-
Impact of summer programmes on the outcomes of disadvantaged or 'at risk' young people: A systematic review.Campbell Syst Rev. 2024 Jun 13;20(2):e1406. doi: 10.1002/cl2.1406. eCollection 2024 Jun. Campbell Syst Rev. 2024. PMID: 38873396 Free PMC article. Review.
References
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources