Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 6;101(1):37-49.
doi: 10.1016/j.ajhg.2017.05.014. Epub 2017 Jun 8.

A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Affiliations

A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Rounak Dey et al. Am J Hum Genet. .

Abstract

The availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits and has great potential to enable identification of genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Because a PheWAS can test thousands of binary phenotypes, and most of them have unbalanced or often extremely unbalanced case-control ratios (1:10 or 1:600, respectively), existing methods cannot provide an accurate and scalable way to test for associations. Here, we propose a computationally fast score-test-based method that estimates the distribution of the test statistic by using the saddlepoint approximation. Our method is much (∼100 times) faster than the state-of-the-art Firth's test. It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls.

Keywords: GWAS; PheWAS; rare variants; saddlepoint approximation; single-variant test; unbalanced case-control.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The Projected Computation Times for Testing 10 Million Variants across 1,500 Phenotypes by Various Tests with MAFs Sampled from the MAF Distribution of the MGI Data The computation times are based on testing 10,000 simulated variants on an Intel i7 2.70GHz processor and then projecting them onto a PheWAS with 10 million variants and 1,500 phenotypes.
Figure 2
Figure 2
Type I Error Comparison between Score, fastSPA-2, and Firth’s Test for Variants Simulated with MAFs Sampled from the MAF Distribution of the MGI Data Type I error rates were estimated on the basis of 109 simulated datasets. From left to right on the x axis, the plots consider case-control ratios 10,000:10,000 (balanced), 2,000:18,000 (moderately unbalanced), and 40:19,960 (extremely unbalanced). The top and bottom panels show empirical type I error rates at α = 5 × 10−5 and 5 × 10−8, respectively.
Figure 3
Figure 3
Type I Error Comparison at Different MAFs between Score, fastSPA-2, and Firth’s Test The top and bottom panels show empirical type I error rates at α = 5 × 10−5 and 5 × 10−8, respectively. From left to right, the plots consider case-control ratios 10,000:10,000 (balanced), 2,000:18,000 (moderately unbalanced), and 40:19,960 (extremely unbalanced). In each plot, the x axis represents MAF with the expected minor allele count (MAC) in parentheses, and the y axis represents empirical type I error rates. Empirical type I error rates were estimated on the basis of 109 simulated datasets. 95% confidence intervals at different MAFs are also presented.
Figure 4
Figure 4
Empirical Power Curves for Score, fastSPA-2, and Firth’s Test The top and bottom panels consider MAF = 0.05 and 0.01, respectively. From left to right, the plots consider case-control ratios 10,000:10,000 (balanced), 2,000:18,000 (moderately unbalanced), and 40:19,960 (extremely unbalanced). In each plot, the x axis represents genotype odds ratios, and the y axis represents the empirical power. Empirical power was estimated from 5,000 simulated datasets at the test-specific α levels where their empirical type I errors were equal to 5 × 10−8.
Figure 5
Figure 5
Q-Q Plots for Score, fastSPA-2, SPA-2, and Firth’s Test on 5 × 106 Simulated Variants with MAF Randomly Sampled from the MAF Distribution of the MGI Data The top, middle, and bottom panels show Q-Q plots in the balanced (case-control ratio = 10,000:10,000), moderately unbalanced (case-control ratio = 2,000:18,000), and extremely unbalanced (case-control ratio = 40:19,960) case-control scenarios, respectively. In each plot, the x axis represents –log10 expected p values, and the y axis represents –log10 observed p values.
Figure 6
Figure 6
Manhattan Plots for Four Different Phenotypes from MGI Data All imputed variants with MAF > 0.001 and all directly genotyped variants were included in this analysis. From left to right, the three panels show associations based on fastSPA-2, Firth’s test, and Score. The red line represents the genome-wide significance level α = 5 × 10−8.
Figure 7
Figure 7
Q-Q Plots for Four Different Phenotypes from MGI Data From left to right, the three panels show the Q-Q plots based on fastSPA-2, Firth’s test, and Score. The plots are color coded according to different MAF categories. 95% confidence bands are presented in gray to signify the deviance from the uniform distribution.

References

    1. Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. - PMC - PubMed
    1. Solovieff N., Cotsapas C., Lee P.H., Purcell S.M., Smoller J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 2013;14:483–495. - PMC - PubMed
    1. Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. - PMC - PubMed
    1. Denny J.C., Crawford D.C., Ritchie M.D., Bielinski S.J., Basford M.A., Bradford Y., Chai H.S., Bastarache L., Zuvich R., Peissig P. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 2011;89:529–542. - PMC - PubMed
    1. Hebbring S.J., Schrodi S.J., Ye Z., Zhou Z., Page D., Brilliant M.H. A PheWAS approach in studying HLA-DRB1∗1501. Genes Immun. 2013;14:187–191. - PMC - PubMed