Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

doi:10.1002/gepi.22197

Meta-Analysis

. 2019 Jul;43(5):462-476.

doi: 10.1002/gepi.22197. Epub 2019 Feb 22.

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

Rounak Dey¹, Jonas B Nielsen², Lars G Fritsche¹, Wei Zhou³, Huanhuan Zhu⁴, Cristen J Willer^{3

5

6}, Seunggeun Lee¹

Affiliations

¹ Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, Michigan.
² Department of Epidemiology Research, Statens Serum Institut, Copenhagen, Denmark.
³ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan.
⁴ Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan.
⁵ Department of Internal Medicine, Division of Cardiovascular Medicine, University of Michigan, Ann Arbor, Michigan.
⁶ Department of Human Genetics, University of Michigan, Ann Arbor, Michigan.

PMID: 30793809
PMCID: PMC6559837
DOI: 10.1002/gepi.22197

Meta-Analysis

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

Rounak Dey et al. Genet Epidemiol. 2019 Jul.

. 2019 Jul;43(5):462-476.

doi: 10.1002/gepi.22197. Epub 2019 Feb 22.

Authors

Rounak Dey¹, Jonas B Nielsen², Lars G Fritsche¹, Wei Zhou³, Huanhuan Zhu⁴, Cristen J Willer^{3

5

6}, Seunggeun Lee¹

Affiliations

¹ Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, Michigan.
² Department of Epidemiology Research, Statens Serum Institut, Copenhagen, Denmark.
³ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan.
⁴ Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan.
⁵ Department of Internal Medicine, Division of Cardiovascular Medicine, University of Michigan, Ann Arbor, Michigan.
⁶ Department of Human Genetics, University of Michigan, Ann Arbor, Michigan.

PMID: 30793809
PMCID: PMC6559837
DOI: 10.1002/gepi.22197

Abstract

With the availability of large-scale biobanks, genome-wide scale phenome-wide association studies are being instrumental in discovering novel genetic variants associated with clinical phenotypes. As increasing number of such association results from different biobanks become available, methods to meta-analyse those association results is of great interest. Because the binary phenotypes in biobank-based studies are mostly unbalanced in their case-control ratios, very few methods can provide well-calibrated tests for associations. For example, traditional Z-score-based meta-analysis often results in conservative or anticonservative Type I error rates in such unbalanced scenarios. We propose two meta-analysis strategies that can efficiently combine association results from biobank-based studies with such unbalanced phenotypes, using the saddlepoint approximation-based score test method. Our first method involves sharing the overall genotype counts from each study, and the second method involves sharing an approximation of the distribution of the score test statistic from each study using cubic Hermite splines. We compare our proposed methods with a traditional Z-score-based meta-analysis strategy using numerical simulations and real data applications, and demonstrate the superior performance of our proposed methods in terms of Type I error control.

Keywords: GWAS; biobank; case-control studies; meta-analysis; saddlepoint approximation.

PubMed Disclaimer

Figures

**Figure 1:**
Type I error comparison between the Z-score based meta-analysis and our proposed CGF-Spline and Genotype Count (GC) methods where the phenotypes, non-genetic covariates and the genotypes are simulated as described in simulation study 1. Joint represents the joint analysis with the pooled data. The top and the bottom panels show empirical type I error rates at genome-wide significance levels α = 5 × 10⁻⁵ and α = 5 × 10⁻⁸, respectively. From left to right, the plots consider the within-study case-control ratios 1:1, 1:9 and 1:49, respectively. In each plot, the X-axis represents MAFs with expected MACs per study in parenthesis, and the Y-axis (in logarithmic scale) represents the empirical type I error rates. 95% confidence intervals at different MAFs are also presented.

**Figure 2:**
Type I error comparison between the Z-score based meta-analysis and our proposed CGF-Spline and Genotype Count (GC) methods where the phenotypes, non-genetic covariates and the genotypes are simulated as described in simulation study 2. Joint represents the joint analysis with the pooled data. The top and the bottom panels show empirical type I error rates at genome-wide significance levels α = 5 × 10⁻⁵ and α = 5 × 10⁻⁸, respectively. From left to right, the plots consider the within-study case-control ratios 1:1, 1:9 and 1:49, respectively. In each plot, the X-axis represents different MAF groups: Rare (variant is rare in all studies), Low frequency (variant is low frequency in all studies), Common (variant is common in all studies) and Different AF (variant is in different allele frequency group in at least two different studies). The Y-axis (in logarithmic scale) represents the empirical type I error rates. 95% confidence intervals at different MAFs are also presented.

**Figure 3:**
Type I error comparison between the Z-score based meta-analysis and our proposed CGF-Spline and Genotype Count (GC) methods where the phenotypes, non-genetic covariates and the genotypes are simulated as described in simulation study 3. Joint represents the joint analysis with the pooled data. The top and the bottom panels show empirical type I error rates at genome-wide significance levels α = 5 × 10⁻⁵ and α = 5 × 10⁻⁸, respectively. The left and right panels consider the within-study case-control ratios 1:9 and 1:49, respectively for the unbalanced studies. In each plot, the X-axis represents MAFs with expected MACs in parenthesis, and the Y-axis (in logarithmic scale) represents the empirical type I error rates. 95% confidence intervals at different MAFs are also presented. The empirical type I error rates were almost identical between ZScore – fastSPA – 2 and ZScore – fastSPA – 0.1, and between GC – fastSPA – 2 and GC – fastSPA – 0.1, and hence the lines are sometimes overlapped in this plot.

**Figure 4:**
Power curves for the Z-score, CGF-Spline and Genotype Count (GC) methods. Top panel considers MAF = 0.01 and bottom panel considers MAF = 0.05. From left to right, the plots consider case-control ratios 1:1, 1:9 and 1:49, respectively. In each plot the X-axis represents genotype odds ratios and the Y-axis represents the empirical power. Empirical power was estimated from 5000 simulated datasets at their type I error adjusted empirical α levels where their empirical type I errors are equal to 5 × 10⁻⁸.

**Figure 5:**
QQ plots for Ulcerative Colitis based on the UK Biobank interim release data. QQ plots using the Z-score method are provided in the left panel, and the QQ plots using our proposed methods are provided on the right panel. The plots are color-coded based on different MAF categories.

**Figure 6:**
QQ plots for Psoriasis based on the UK Biobank interim release data. QQ plots using the Z-score method are provided in the left panel, and the QQ plots using our proposed methods are provided on the right panel. The plots are color-coded based on different MAF categories.

See this image and copyright information in PMC

Cited by

Establishing a Structured Hypospadias Biobank Cohort for Integrated Research: Methodology, Comprehensive Database Integration, and Phenotyping.
Abbas TO, Al-Shafai K, Jamil A, Mancha M, Azzah A, Arar S, Kumar S, Al Massih A, Mackeh R, Tomei S, Saraiva LR. Abbas TO, et al. Diagnostics (Basel). 2025 Feb 26;15(5):561. doi: 10.3390/diagnostics15050561. Diagnostics (Basel). 2025. PMID: 40075808 Free PMC article.
A Fast and Accurate Method for Genome-wide Scale Phenome-wide G × E Analysis and Its Application to UK Biobank.
Bi W, Zhao Z, Dey R, Fritsche LG, Mukherjee B, Lee S. Bi W, et al. Am J Hum Genet. 2019 Dec 5;105(6):1182-1192. doi: 10.1016/j.ajhg.2019.10.008. Epub 2019 Nov 14. Am J Hum Genet. 2019. PMID: 31735295 Free PMC article.
Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks.
Dey R, Zhou W, Kiiskinen T, Havulinna A, Elliott A, Karjalainen J, Kurki M, Qin A; FinnGen; Lee S, Palotie A, Neale B, Daly M, Lin X. Dey R, et al. Nat Commun. 2022 Sep 16;13(1):5437. doi: 10.1038/s41467-022-32885-x. Nat Commun. 2022. PMID: 36114182 Free PMC article.
A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank.
Bi W, Fritsche LG, Mukherjee B, Kim S, Lee S. Bi W, et al. Am J Hum Genet. 2020 Aug 6;107(2):222-233. doi: 10.1016/j.ajhg.2020.06.003. Epub 2020 Jun 25. Am J Hum Genet. 2020. PMID: 32589924 Free PMC article.
Cross-ancestry genome-wide meta-analysis of 61,047 cases and 947,237 controls identifies new susceptibility loci contributing to lung cancer.
Byun J, Han Y, Li Y, Xia J, Long E, Choi J, Xiao X, Zhu M, Zhou W, Sun R, Bossé Y, Song Z, Schwartz A, Lusk C, Rafnar T, Stefansson K, Zhang T, Zhao W, Pettit RW, Liu Y, Li X, Zhou H, Walsh KM, Gorlov I, Gorlova O, Zhu D, Rosenberg SM, Pinney S, Bailey-Wilson JE, Mandal D, de Andrade M, Gaba C, Willey JC, You M, Anderson M, Wiencke JK, Albanes D, Lam S, Tardon A, Chen C, Goodman G, Bojeson S, Brenner H, Landi MT, Chanock SJ, Johansson M, Muley T, Risch A, Wichmann HE, Bickeböller H, Christiani DC, Rennert G, Arnold S, Field JK, Shete S, Le Marchand L, Melander O, Brunnstrom H, Liu G, Andrew AS, Kiemeney LA, Shen H, Zienolddiny S, Grankvist K, Johansson M, Caporaso N, Cox A, Hong YC, Yuan JM, Lazarus P, Schabath MB, Aldrich MC, Patel A, Lan Q, Rothman N, Taylor F, Kachuri L, Witte JS, Sakoda LC, Spitz M, Brennan P, Lin X, McKay J, Hung RJ, Amos CI. Byun J, et al. Nat Genet. 2022 Aug;54(8):1167-1177. doi: 10.1038/s41588-022-01115-x. Epub 2022 Aug 1. Nat Genet. 2022. PMID: 35915169 Free PMC article.

See all "Cited by" articles

References

1. Barndorff-Nielsen OE (1990). Approximate Interval Probabilities. Journal of the Royal Statistical Society. Series B (Methodological), 52(3), 485–496.
1. Brent RP (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.
1. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, … Marchini J (2017). Genome-wide genetic data on ~500,000 UK Biobank participants. bioRxiv 166298 doi:10.1101/166298. - DOI
1. Cooper HM, Hedges LV, & Valentine JC (2009). The handbook of research synthesis and meta-analysis (2nd ed.). New York: Russell Sage Foundation.
1. Daniels HE (1954). Saddlepoint Approximations in Statistics. Annals of Mathematical Statistics, 25(4), 631–650. doi:10.1214/aoms/1177728652 - DOI

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 HG008773/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Barndorff-Nielsen OE (1990). Approximate Interval Probabilities. Journal of the Royal Statistical Society. Series B (Methodological), 52(3), 485–496.

[2] Barndorff-Nielsen OE (1990). Approximate Interval Probabilities. Journal of the Royal Statistical Society. Series B (Methodological), 52(3), 485–496.

[3] Brent RP (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.

[4] Brent RP (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.

[5] Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, … Marchini J (2017). Genome-wide genetic data on ~500,000 UK Biobank participants. bioRxiv 166298 doi:10.1101/166298. - DOI

[6] Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, … Marchini J (2017). Genome-wide genetic data on ~500,000 UK Biobank participants. bioRxiv 166298 doi:10.1101/166298. - DOI

[7] Cooper HM, Hedges LV, & Valentine JC (2009). The handbook of research synthesis and meta-analysis (2nd ed.). New York: Russell Sage Foundation.

[8] Cooper HM, Hedges LV, & Valentine JC (2009). The handbook of research synthesis and meta-analysis (2nd ed.). New York: Russell Sage Foundation.

[9] Daniels HE (1954). Saddlepoint Approximations in Statistics. Annals of Mathematical Statistics, 25(4), 631–650. doi:10.1214/aoms/1177728652 - DOI

[10] Daniels HE (1954). Saddlepoint Approximations in Statistics. Annals of Mathematical Statistics, 25(4), 631–650. doi:10.1214/aoms/1177728652 - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

Affiliations

Robust meta-analysis of biobank-based genome-wide association studies with unbalanced binary phenotypes

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials