Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Feb 16:7:15.
doi: 10.3389/fgene.2016.00015. eCollection 2016.

Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics

Affiliations

Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics

Dominic Holland et al. Front Genet. .

Abstract

Genome-wide Association Studies (GWAS) result in millions of summary statistics ("z-scores") for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype and predicting the proportion of chip heritability explainable by genome-wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N = 82,315) and putamen volume (N = 12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We assess the degree to which effect sizes are over-estimated when based on linear-regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 10(6) and 10(5). The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.

Keywords: GWAS; Gaussian mixture model; SNP discovery; effect size; heritability; putamen; schizophrenia.

PubMed Disclaimer

Figures

Figure 1
Figure 1
For schizophrenia, posterior estimate of (A) effect size and (B) variance; (C) estimate of replication probability for zt = −1.64 (i.e., pt = 0.05): empirical (black solid lines), current model (red solid lines), model with no ubiquitous effects (green dashed lines), and model with no sparse effects (blue dashed lines), for split-half discovery and replication data.
Figure 2
Figure 2
Posterior effect size and variance, calculated for effective sample size Nd = Nr = N ≃ 34, 000—see Equations 12, 13, 20 and Supplementary Material. Note that sparse effects have a component that arises from ubiquitous effects. z = δ + ϵ, where δ = δu + δs and E(ϵ) = 0; δu are ubiquitous effects, while δs are additional contributions to total sparse effects.
Figure 3
Figure 3
Randomly culling SNPs with LD r2 ≥ 0.8 and further restricting to SNPs with total LD (TLD) less than 15 (approximately 1 million SNPs remaining) shows a diminution in the extent of ubiquitous effects (decreased slope near the origin for the red curve), consistent with an interpretation that the ubiquitous effects arise due to LD with causal SNPs. The black plot is for light random pruning at r2 ≥ 0.8, shown in Figure 1A.
Figure 4
Figure 4
(A) Empirical and model QQ plots for putamen volume and schizophrenia. (B) Proportion of total additive genetic variance or chip heritability explained by sparse effects for all “tagged” SNPs with p-value less than the GWAS p-value threshold (pt=5×10-8), as a function of effective sample size, for putamen volume and schizophrenia (the asterisks correspond to the current effective sample sizes for ENIGMA and PGS2). Of the total variance that is explained by sparse effects for all SNPs, the proportion explained by SNPs currently reaching the usual GWAS significance level is approximately 15% for both phenotypes.
Figure 5
Figure 5
For schizophrenia, (A) posterior estimates of effect-size-squared, as given by Equation 25, vs. z2 for three total sample sizes. When assuming that the phenotypic variance explained by a SNP is given by β^2Hz2N, the degree to which this is an over-estimate is indicated by the ratio of the height of the black dashed line (the assumption δ2 = z2) to the height of the corresponding point on the curve for a given sample size. The asterisks correspond to the threshold significant z-score. (B) For a multistage GWAS, where discovery is from a subset (20%, 50%, 90%) of the total PGC2 sample, the curves give the probability of a SNP with p-value p in the discovery sample passing genome-wide significance (pdr<pt=5×10-8) in the combined (total) data set, Equation 29. The vertical gray line is at p = pt.

References

    1. 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073. 10.1038/nature09534 - DOI - PMC - PubMed
    1. Andreassen O. A., Djurovic S., Thompson W. K., Schork A. J., Kendler K. S., O'Donovan M. C., et al. (2013a). Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. Am. J. Hum. Genet. 92, 197–209. 10.1016/j.ajhg.2013.01.001 - DOI - PMC - PubMed
    1. Andreassen O. A., Thompson W. K., Schork A. J., Ripke S., Mattingsdal M., Kelsoe J. R., et al. (2013b). Improved detection of common variants associated with schizophrenia and bipolar disorder using pleiotropy-informed conditional false discovery rate. PLoS Genet. 9:e1003455. 10.1371/journal.pgen.1003455 - DOI - PMC - PubMed
    1. Bulik-Sullivan B. K., Loh P.-R., Finucane H. K., Ripke S., Yang J., Patterson N., et al. (2015). Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295. 10.1038/ng.3211 - DOI - PMC - PubMed
    1. Chatterjee N., Wheeler B., Sampson J., Hartge P., Chanock S. J., Park J.-H. (2013). Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat. Genet. 45, 400–405. 10.1038/ng.2579 - DOI - PMC - PubMed