Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Wei Zhou^{1

2}, Jonas B Nielsen³, Lars G Fritsche^{2

4

5}, Rounak Dey^{2

5}, Maiken E Gabrielsen⁴, Brooke N Wolford^{1

2}, Jonathon LeFaive^{2

5}, Peter VandeHaar^{2

5}, Sarah A Gagliano^{2

5}, Aliya Gifford⁶, Lisa A Bastarache⁶, Wei-Qi Wei⁶, Joshua C Denny^{6

7}, Maoxuan Lin³, Kristian Hveem^{4

8}, Hyun Min Kang^{2

5}, Goncalo R Abecasis^{2

5}, Cristen J Willer^{9

10

11}, Seunggeun Lee^{12

13}

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
² Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
³ Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA.
⁴ K. G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway.
⁵ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
⁶ Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
⁷ Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
⁸ HUNT Research Centre, Department of Public Health and General Practice, Norwegian University of Science and Technology, Levanger, Norway.
⁹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. cristen@umich.edu.
¹⁰ Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA. cristen@umich.edu.
¹¹ Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI, USA. cristen@umich.edu.
¹² Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA. leeshawn@umich.edu.
¹³ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA. leeshawn@umich.edu.

PMID: 30104761
PMCID: PMC6119127
DOI: 10.1038/s41588-018-0184-y

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Wei Zhou et al. Nat Genet. 2018 Sep.

. 2018 Sep;50(9):1335-1341.

doi: 10.1038/s41588-018-0184-y. Epub 2018 Aug 13.

Authors

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
² Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
³ Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA.
⁴ K. G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway.
⁵ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
⁶ Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
⁷ Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.
⁸ HUNT Research Centre, Department of Public Health and General Practice, Norwegian University of Science and Technology, Levanger, Norway.
⁹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. cristen@umich.edu.
¹⁰ Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA. cristen@umich.edu.
¹¹ Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI, USA. cristen@umich.edu.
¹² Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA. leeshawn@umich.edu.
¹³ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA. leeshawn@umich.edu.

PMID: 30104761
PMCID: PMC6119127
DOI: 10.1038/s41588-018-0184-y

Abstract

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS STATEMENT

The authors declare no competing financial interests.

Figures

**Figure 1**
Manhattan plots of GWAS results for four binary phenotypes with various case-control ratios in the UK Biobank. GWAS results from SAIGE, SAIGE-NoSPA(asymptotically equivalent to GMMAT) and BOLT-LMM are shown for A. coronary artery disease (PheCode 411, case:control = 1:12, N = 408,458), B. colorectal cancer (PheCode 153, case:control = 1:84, N = 387,318), C. glaucoma (PheCode 365, case: control = 1:89, N = 402,223), and D. thyroid cancer (PheCode 193, case:control=1:1138, N = 407,757). N: sample size. Blue: loci with association p-value < 5×10⁻⁸, which have been previously reported, Green: loci that have association p-value < 5×10⁻⁸ and have not been reported before. Since results from SAIGE-noSPA and BOLT-LMM contain many false positive signals for colorectal cancer, glaucoma, and thyroid cancer, the significant loci are not highlighted. The upper dashed line marks the break point for the different scales of the y axis and the lower dashed line marks the genome-wide significance (p-value = 5×10⁻⁸).

**Figure 2**
Quantile-quantile plots of GWAS results for four binary phenotypes with various case-control ratios in the UK Biobank. GWAS results from SAIGE, SAIGE-NoSPA (asymptotically equivalent to GMMAT) and BOLT-LMM are shown for A. coronary artery disease (PheCode 411, case: control = 1:12, N = 408,458), B. colorectal cancer (PheCode 153, case: control = 1:84, N = 387,318), C. glaucoma (PheCode 365, case: control = 1:89, N = 402,223), and D. thyroid cancer (PheCode 193, case: control=1:1138, N = 407,757). N: sample size.

See this image and copyright information in PMC

References

1. Bush WS, Oetjens MT, Crawford DC. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nat Rev Genet. 2016;17:129–145. - PubMed
1. Denny JC, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol. 2013;31:1102–1110. - PMC - PubMed
1. Kang HM, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42:348–354. - PMC - PubMed
1. Zhang Z, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42:355–360. - PMC - PubMed
1. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88:76–82. - PMC - PubMed

METHODS-ONLY REFERENCES

1. Breslow NE, Clayton DG. Approximate Inference in Generalized Linear Mixed Models. J Am Stat Assoc. 1993;88:9.
1. Gilmour AR, Thompson R, Cullis BR. Average Information REML: An Efficient Algorithm for Variance Parameter Estimation in Linear Mixed Models. Biometrics. 1995;51:1440.
1. Kaasschieter EF. Preconditioned conjugate gradients for solving singular systems. J Comput Appl Math. 1988;24:265–275.
1. Hestenes MR, Eduard S. Methods of conjugate gradients for solving linear systems. Vol. 49. NBS; 1952.
1. Loh PR, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet. 2015;47:284–290. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Affiliations

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

METHODS-ONLY REFERENCES

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources