Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jun 15:12:682638.
doi: 10.3389/fgene.2021.682638. eCollection 2021.

Scalable and Robust Regression Methods for Phenome-Wide Association Analysis on Large-Scale Biobank Data

Affiliations
Review

Scalable and Robust Regression Methods for Phenome-Wide Association Analysis on Large-Scale Biobank Data

Wenjian Bi et al. Front Genet. .

Abstract

With the advances in genotyping technologies and electronic health records (EHRs), large biobanks have been great resources to identify novel genetic associations and gene-environment interactions on a genome-wide and even a phenome-wide scale. To date, several phenome-wide association studies (PheWAS) have been performed on biobank data, which provides comprehensive insights into many aspects of human genetics and biology. Although inspiring, PheWAS on large-scale biobank data encounters new challenges including computational burden, unbalanced phenotypic distribution, and genetic relationship. In this paper, we first discuss these new challenges and their potential impact on data analysis. Then, we summarize approaches that are scalable and robust in GWAS and PheWAS. This review can serve as a practical guide for geneticists, epidemiologists, and other medical researchers to identify genetic variations associated with health-related phenotypes in large-scale biobank data analysis. Meanwhile, it can also help statisticians to gain a comprehensive and up-to-date understanding of the current technical tool development.

Keywords: biobank data analysis; electronic health records-EHR; genetic relatedness; mixed model approaches; phenome-wide association studies; saddlepoint approximation; unbalanced phenotypic distribution.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
PheWAS computation time. The computation time is evaluated at CPU core of Intel i7-7700T 2.90GHz and then projected to a phenome-wide association studies including 100 balanced binary phenotypes and 10 million variants.
FIGURE 2
FIGURE 2
Flowchart of score test association analysis.

Similar articles

Cited by

References

    1. Agresti A. (2003). Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons.
    1. Aguilar I., Misztal I., Legarra A., Tsuruta S. (2011). Efficient computation of the genomic relationship matrix and other matrices used in single-step evaluation. J.Anim. Breed. Genet. 128 422–428. 10.1111/j.1439-0388.2010.00912.x - DOI - PubMed
    1. All of Us Research Program Investigators. (2019). The “All of Us” research program. N. Engl. J. Med. 381 668–676. - PMC - PubMed
    1. Allaire J. J., François R., Ushey K., Vandenbrouck G., Geelnard M. (2018). RcppParallel: Parallel Programming Tools for ‘Rcpp’. R Package Version 4.4. 2.
    1. Altman D. G., Bland J. M. (1998). Time to event (survival) data. Bmj 317 468–469. - PMC - PubMed

LinkOut - more resources