Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2019 Jul;43(5):462-476.
doi: 10.1002/gepi.22197. Epub 2019 Feb 22.

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

Affiliations
Meta-Analysis

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

Rounak Dey et al. Genet Epidemiol. 2019 Jul.

Abstract

With the availability of large-scale biobanks, genome-wide scale phenome-wide association studies are being instrumental in discovering novel genetic variants associated with clinical phenotypes. As increasing number of such association results from different biobanks become available, methods to meta-analyse those association results is of great interest. Because the binary phenotypes in biobank-based studies are mostly unbalanced in their case-control ratios, very few methods can provide well-calibrated tests for associations. For example, traditional Z-score-based meta-analysis often results in conservative or anticonservative Type I error rates in such unbalanced scenarios. We propose two meta-analysis strategies that can efficiently combine association results from biobank-based studies with such unbalanced phenotypes, using the saddlepoint approximation-based score test method. Our first method involves sharing the overall genotype counts from each study, and the second method involves sharing an approximation of the distribution of the score test statistic from each study using cubic Hermite splines. We compare our proposed methods with a traditional Z-score-based meta-analysis strategy using numerical simulations and real data applications, and demonstrate the superior performance of our proposed methods in terms of Type I error control.

Keywords: GWAS; biobank; case-control studies; meta-analysis; saddlepoint approximation.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Type I error comparison between the Z-score based meta-analysis and our proposed CGF-Spline and Genotype Count (GC) methods where the phenotypes, non-genetic covariates and the genotypes are simulated as described in simulation study 1. Joint represents the joint analysis with the pooled data. The top and the bottom panels show empirical type I error rates at genome-wide significance levels α = 5 × 10−5 and α = 5 × 10−8, respectively. From left to right, the plots consider the within-study case-control ratios 1:1, 1:9 and 1:49, respectively. In each plot, the X-axis represents MAFs with expected MACs per study in parenthesis, and the Y-axis (in logarithmic scale) represents the empirical type I error rates. 95% confidence intervals at different MAFs are also presented.
Figure 2:
Figure 2:
Type I error comparison between the Z-score based meta-analysis and our proposed CGF-Spline and Genotype Count (GC) methods where the phenotypes, non-genetic covariates and the genotypes are simulated as described in simulation study 2. Joint represents the joint analysis with the pooled data. The top and the bottom panels show empirical type I error rates at genome-wide significance levels α = 5 × 10−5 and α = 5 × 10−8, respectively. From left to right, the plots consider the within-study case-control ratios 1:1, 1:9 and 1:49, respectively. In each plot, the X-axis represents different MAF groups: Rare (variant is rare in all studies), Low frequency (variant is low frequency in all studies), Common (variant is common in all studies) and Different AF (variant is in different allele frequency group in at least two different studies). The Y-axis (in logarithmic scale) represents the empirical type I error rates. 95% confidence intervals at different MAFs are also presented.
Figure 3:
Figure 3:
Type I error comparison between the Z-score based meta-analysis and our proposed CGF-Spline and Genotype Count (GC) methods where the phenotypes, non-genetic covariates and the genotypes are simulated as described in simulation study 3. Joint represents the joint analysis with the pooled data. The top and the bottom panels show empirical type I error rates at genome-wide significance levels α = 5 × 10−5 and α = 5 × 10−8, respectively. The left and right panels consider the within-study case-control ratios 1:9 and 1:49, respectively for the unbalanced studies. In each plot, the X-axis represents MAFs with expected MACs in parenthesis, and the Y-axis (in logarithmic scale) represents the empirical type I error rates. 95% confidence intervals at different MAFs are also presented. The empirical type I error rates were almost identical between ZScore – fastSPA – 2 and ZScore – fastSPA – 0.1, and between GC – fastSPA – 2 and GC – fastSPA – 0.1, and hence the lines are sometimes overlapped in this plot.
Figure 4:
Figure 4:
Power curves for the Z-score, CGF-Spline and Genotype Count (GC) methods. Top panel considers MAF = 0.01 and bottom panel considers MAF = 0.05. From left to right, the plots consider case-control ratios 1:1, 1:9 and 1:49, respectively. In each plot the X-axis represents genotype odds ratios and the Y-axis represents the empirical power. Empirical power was estimated from 5000 simulated datasets at their type I error adjusted empirical α levels where their empirical type I errors are equal to 5 × 10−8.
Figure 5:
Figure 5:
QQ plots for Ulcerative Colitis based on the UK Biobank interim release data. QQ plots using the Z-score method are provided in the left panel, and the QQ plots using our proposed methods are provided on the right panel. The plots are color-coded based on different MAF categories.
Figure 6:
Figure 6:
QQ plots for Psoriasis based on the UK Biobank interim release data. QQ plots using the Z-score method are provided in the left panel, and the QQ plots using our proposed methods are provided on the right panel. The plots are color-coded based on different MAF categories.

Similar articles

Cited by

References

    1. Barndorff-Nielsen OE (1990). Approximate Interval Probabilities. Journal of the Royal Statistical Society. Series B (Methodological), 52(3), 485–496.
    1. Brent RP (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.
    1. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, … Marchini J (2017). Genome-wide genetic data on ~500,000 UK Biobank participants. bioRxiv 166298 doi:10.1101/166298. - DOI
    1. Cooper HM, Hedges LV, & Valentine JC (2009). The handbook of research synthesis and meta-analysis (2nd ed.). New York: Russell Sage Foundation.
    1. Daniels HE (1954). Saddlepoint Approximations in Statistics. Annals of Mathematical Statistics, 25(4), 631–650. doi:10.1214/aoms/1177728652 - DOI

Publication types