Searching for missing heritability: designing rare variant association studies

Or Zuk¹, Stephen F Schaffner, Kaitlin Samocha, Ron Do, Eliana Hechter, Sekar Kathiresan, Mark J Daly, Benjamin M Neale, Shamil R Sunyaev, Eric S Lander

Affiliations

PMID: 24443550
PMCID: PMC3910587
DOI: 10.1073/pnas.1322563111

Searching for missing heritability: designing rare variant association studies

Or Zuk et al. Proc Natl Acad Sci U S A. 2014.

. 2014 Jan 28;111(4):E455-64.

doi: 10.1073/pnas.1322563111. Epub 2014 Jan 17.

Authors

Or Zuk¹, Stephen F Schaffner, Kaitlin Samocha, Ron Do, Eliana Hechter, Sekar Kathiresan, Mark J Daly, Benjamin M Neale, Shamil R Sunyaev, Eric S Lander

Affiliation

¹ Broad Institute of Harvard and MIT, Cambridge, MA 02142.

PMID: 24443550
PMCID: PMC3910587
DOI: 10.1073/pnas.1322563111

Abstract

Genetic studies have revealed thousands of loci predisposing to hundreds of human diseases and traits, revealing important biological pathways and defining novel therapeutic hypotheses. However, the genes discovered to date typically explain less than half of the apparent heritability. Because efforts have largely focused on common genetic variants, one hypothesis is that much of the missing heritability is due to rare genetic variants. Studies of common variants are typically referred to as genomewide association studies, whereas studies of rare variants are often simply called sequencing studies. Because they are actually closely related, we use the terms common variant association study (CVAS) and rare variant association study (RVAS). In this paper, we outline the similarities and differences between RVAS and CVAS and describe a conceptual framework for the design of RVAS. We apply the framework to address key questions about the sample sizes needed to detect association, the relative merits of testing disruptive alleles vs. missense alleles, frequency thresholds for filtering alleles, the value of predictors of the functional impact of missense alleles, the potential utility of isolated populations, the value of gene-set analysis, and the utility of de novo mutations. The optimal design depends critically on the selection coefficient against deleterious alleles and thus varies across genes. The analysis shows that common variant and rare variant studies require similarly large sample collections. In particular, a well-powered RVAS should involve discovery sets with at least 25,000 cases, together with a substantial replication set.

Keywords: mapping disease genes; power analysis; statistical genetics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Allele frequencies from simulations for various demographic models. (A) CAF (average 50,000 simulations) as a function of selection coefficient s. CAF is not sensitive to demographic history, as noted in the text. (B) Cumulative distribution of IAF as a function of s, for the ancestral population at equilibrium. (C) Cumulative distribution of IAF as a function of s, for the European population. Compared with the ancestral population, the IAF distribution is skewed toward rare alleles for intermediate to strong selection s (*SI Appendix, Sections 2.2–2.3*).

**Fig. 2.**
Power to detect association for a typical gene. (A) Number of cases needed to detect association based on excess of disruptive disease-predisposing variants in cases vs. general population, as a function of selection coefficient s (or frequency f_D of disruptive alleles). Curves represent various effect sizes 1 + λ. (Values for 90% power and 5% false-positive rate after Bonferroni correction for testing 20,000 genes.) (*SI Appendix, Sections 3.2 and 4*). (B) Number of cases to detect association based on deficit of disruptive protective variants. Curves represent various values of (1 + λ)⁻¹. The curve for (1 + λ)⁻¹ = 2 corresponds to twofold protection (50% lower disease risk), whereas the curve for (1 + λ)⁻¹ = corresponds to complete protection. (C) Relative contribution of missense vs. disruptive variants for European model. Curves for various selection coefficients show ratio of expected LOD scores for testing an excess of rare missense variants with frequency below threshold T divided by expected LOD score for disruptive variants. Values are calculated for effect size for null alleles of 1 + λ = 4. For each s, there is an optimal threshold T* at which missense alleles provide maximal relative contribution (typically ∼1.0- to 1.3-fold) (*SI Appendix, Section 5*). (D) Relative contribution of missense (vs. disruptive) variants at optimal threshold T* as a function of 1 + λ for various populations. Solid lines show values when filtering missense alleles by frequency threshold T*. Dashed lines show values when also filtering to include only missense alleles predicted to be deleterious (by a high-quality predictor with false-positive and false-negative rates of 20%). The functional predictor increases the contribution of missense alleles—e.g., from 1.5-fold to 2.1-fold for genes 1 + λ = 10 in the European population (*SI Appendix, Sections 5 and 6*).

formula image — **Fig. 2.**
Power to detect association for a typical gene. (A) Number of cases needed to detect association based on excess of disruptive disease-predisposing variants in cases vs. general population, as a function of selection coefficient s (or frequency f_D of disruptive alleles). Curves represent various effect sizes 1 + λ. (Values for 90% power and 5% false-positive rate after Bonferroni correction for testing 20,000 genes.) (*SI Appendix, Sections 3.2 and 4*). (B) Number of cases to detect association based on deficit of disruptive protective variants. Curves represent various values of (1 + λ)⁻¹. The curve for (1 + λ)⁻¹ = 2 corresponds to twofold protection (50% lower disease risk), whereas the curve for (1 + λ)⁻¹ = corresponds to complete protection. (C) Relative contribution of missense vs. disruptive variants for European model. Curves for various selection coefficients show ratio of expected LOD scores for testing an excess of rare missense variants with frequency below threshold T divided by expected LOD score for disruptive variants. Values are calculated for effect size for null alleles of 1 + λ = 4. For each s, there is an optimal threshold T* at which missense alleles provide maximal relative contribution (typically ∼1.0- to 1.3-fold) (*SI Appendix, Section 5*). (D) Relative contribution of missense (vs. disruptive) variants at optimal threshold T* as a function of 1 + λ for various populations. Solid lines show values when filtering missense alleles by frequency threshold T*. Dashed lines show values when also filtering to include only missense alleles predicted to be deleterious (by a high-quality predictor with false-positive and false-negative rates of 20%). The functional predictor increases the contribution of missense alleles—e.g., from 1.5-fold to 2.1-fold for genes 1 + λ = 10 in the European population (*SI Appendix, Sections 5 and 6*).

**Fig. 3.**
Chances of being lucky. Figures show the right tail of the CAF (f_null) distribution for four selection coefficients s (10⁻³, 10^−2.5, 10⁻², 10^−1.5) and four demographic models. Curves show probability that the realized value of the CAF (f_null) for all null alleles, (in absolute terms and normalized to the expected value given the selection coefficient) exceeds the value on the x-axis, with results obtained from 50,000 simulations of gene histories for each value of s and demography. Finland and Iceland show heavy right tails (genes with CAF much larger than the expected value), because population bottlenecks scatter allele frequencies. For s = 10⁻³ in Finland, 3.5% of genes have CAF that is 10-fold higher than expected—making it possible to discover the genes with a 10-fold lower example size than expected. The distributions depend on bottleneck size, number of generations since expansion, mutation rate and selection coefficient (*SI Appendix, Section 8*). Tighter bottlenecks, as in Finland vs. Iceland, allow fewer alleles to pass, but result in greater proportional increase in allele frequency. (Calculations assume μ_null = 5 × 10⁻⁶, corresponding to α = 25%.)

See this image and copyright information in PMC

References

1. Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. - PMC - PubMed
1. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci USA. 2012;109(4):1193–1198. - PMC - PubMed
1. Antonarakis SE, Chakravarti A, Cohen JC, Hardy J. Mendelian disorders and multifactorial traits: The big divide or one for all? Nat Rev Genet. 2010;11(5):380–384. - PubMed
1. Cohen J, et al. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat Genet. 2005;37(2):161–165. - PubMed
1. Cohen JC, Boerwinkle E, Mosley TH, Jr, Hobbs HH. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med. 2006;354(12):1264–1272. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U54 HG003067/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Searching for missing heritability: designing rare variant association studies

Affiliation

Searching for missing heritability: designing rare variant association studies

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources