. 2019 Jan 22;20(1):46.

doi: 10.1186/s12859-018-2591-6.

Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico

Xinyuan Zhang¹, Anna O Basile², Sarah A Pendergrass³, Marylyn D Ritchie^{4

5}

Affiliations

¹ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
² Department of Biomedical Informatics, Columbia University, New York, NY, USA.
³ Biomedical and Translational Informatics Institute, Geisinger, Danville, PA, USA.
⁴ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. marylyn@pennmedicine.upenn.edu.
⁵ Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA. marylyn@pennmedicine.upenn.edu.

PMID: 30669967
PMCID: PMC6343276
DOI: 10.1186/s12859-018-2591-6

Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico

Xinyuan Zhang et al. BMC Bioinformatics. 2019.

. 2019 Jan 22;20(1):46.

doi: 10.1186/s12859-018-2591-6.

Authors

Xinyuan Zhang¹, Anna O Basile², Sarah A Pendergrass³, Marylyn D Ritchie^{4

5}

Affiliations

¹ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
² Department of Biomedical Informatics, Columbia University, New York, NY, USA.
³ Biomedical and Translational Informatics Institute, Geisinger, Danville, PA, USA.
⁴ Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. marylyn@pennmedicine.upenn.edu.
⁵ Department of Genetics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA. marylyn@pennmedicine.upenn.edu.

PMID: 30669967
PMCID: PMC6343276
DOI: 10.1186/s12859-018-2591-6

Abstract

Background: The development of sequencing techniques and statistical methods provides great opportunities for identifying the impact of rare genetic variation on complex traits. However, there is a lack of knowledge on the impact of sample size, case numbers, the balance of cases vs controls for both burden and dispersion based rare variant association methods. For example, Phenome-Wide Association Studies may have a wide range of case and control sample sizes across hundreds of diagnoses and traits, and with the application of statistical methods to rare variants, it is important to understand the strengths and limitations of the analyses.

Results: We conducted a large-scale simulation of randomly selected low-frequency protein-coding regions using twelve different balanced samples with an equal number of cases and controls as well as twenty-one unbalanced sample scenarios. We further explored statistical performance of different minor allele frequency thresholds and a range of genetic effect sizes. Our simulation results demonstrate that using an unbalanced study design has an overall higher type I error rate for both burden and dispersion tests compared with a balanced study design. Regression has an overall higher type I error with balanced cases and controls, while SKAT has higher type I error for unbalanced case-control scenarios. We also found that both type I error and power were driven by the number of cases in addition to the case to control ratio under large control group scenarios. Based on our power simulations, we observed that a SKAT analysis with case numbers larger than 200 for unbalanced case-control models yielded over 90% power with relatively well controlled type I error. To achieve similar power in regression, over 500 cases are needed. Moreover, SKAT showed higher power to detect associations in unbalanced case-control scenarios than regression.

Conclusions: Our results provide important insights into rare variant association study designs by providing a landscape of type I error and statistical power for a wide range of sample sizes. These results can serve as a benchmark for making decisions about study design for rare variant analyses.

Keywords: Power analysis; Rare variant association analysis; Sample size imbalance; Simulation study.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Competing interests

All authors have no conflict of interest to declare.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Type I error simulation results with MAF UB of 0.01. For visualization and comparison purposes, blue and red horizontal lines indicate type I error at 0.05 and 0.1 respectively. Fig. (a) shows the results for type I error for an equal number of cases and controls for differing sample sizes. Note that the y-axis only goes to a type I error rate of 0.1. Fig. (b) shows the type I error rate for different unbalanced cases and controls as arranged by case to control ratio. The axis is labeled by the number of cases then the number of controls for each simulation. The percentage of cases to controls is also listed below the number of cases and controls. Figs. (c and d) show the results as ordered by the number of cases. Figure 1c has 10,000 control and Fig. 1d has 30,000 control

**Fig. 2**
Power simulation results with cutoff for evaluated variation of MAF 0.01. Fig. (a) shows the results when cases and controls are equal in number. Fig. (b) shows the impact of unbalanced cases and controls on power ranked by the case/control ratio. The percent case to control ratio is listed below the x-axis. Figs. (c and d) show the results for power with unbalanced cases and controls ordered by case number with 10,000 controls (c) and 30,000 controls (d)

**Fig. 3**
Power comparison of three models with differing contributions from protective and risk rare genetic variation. The results are shown for variants contributing low, moderate, or high impact on outcome risk or protection. Methods describe the range of odds ratios corresponding to the different categories. (a) Total sample size of 4000 for balanced cases and controls with MAF UB 0.05. (b) Total sample size of 4000 for balanced cases and controls with MAF UB 0.01. (c) 200 cases and 10,000 controls with MAF UB 0.05. (d) 200 cases and 10,000 controls with MAF UB 0.01

See this image and copyright information in PMC

References

1. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69:124–137. doi: 10.1086/321272. - DOI - PMC - PubMed
1. Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. 2001;17:502–510. doi: 10.1016/S0168-9525(01)02410-6. - DOI - PubMed
1. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Reviews Genetics. Nature Publishing Group. 2010;11:415–425. doi: 10.1038/nrg2779. - DOI - PubMed
1. Gibson G. Rare and common variants: twenty arguments. Nature Reviews Genetics. Nature Publishing Group. 2012;13:135–145. doi: 10.1038/nrg3118. - DOI - PMC - PubMed
1. Zuk O, Schaffner SF, Samocha K, Do R, Hechter E, Kathiresan S, et al. Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci U S A. 2014;111:E455–E464. doi: 10.1073/pnas.1322563111. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico

Affiliations

Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases