Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 6;107(2):222-233.
doi: 10.1016/j.ajhg.2020.06.003. Epub 2020 Jun 25.

A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank

Affiliations

A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank

Wenjian Bi et al. Am J Hum Genet. .

Abstract

With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.

Keywords: Cox proportional hazards regression model; GWAS; PheWAS; UK Biobank; electronic health record; saddlepoint approximation; survival analysis; time-to-event data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Quantile-Quantile (QQ) Plots for Standardized Statistics and −log10(P) of Score and Wald Tests (A and B) Normal QQ plots for standardized statistics (A) and QQ plots for -log10(P) of score and Wald tests (B). Standardized statistics were calculated as raw statistics S divided by the estimated variance Varˆ(S). p values were calculated from a normal approximation. We simulated 2×105 replications under three event rates (ERs) of 1%, 10%, and 50%. The sample size was 4,000, and we considered common variants (MAF = 0.3, expected MAC = 2,400) and low-frequency variants (MAF = 0.01, expected MAC = 80). MAF, minor allele frequency; MAC, minor allele counts.
Figure 2
Figure 2
Projected Computation Time for a Genome-wide Time-to-Event Data Analysis of 20 Million Variants The projected time is based on computation time for 10,000 variants on an Intel Xeon Platinum 8176 CPU at 2.10 GHz. For example, suppose it takes α h to analyze 10,000 variants, then the projected time for 20 million variants is 2,000 ×α h. Solid and dashed lines represent ERs of 1% and 50%, respectively. The MAFs are randomly generated from the MAF distribution of UK Biobank, and we considered 10 covariates.
Figure 3
Figure 3
Empirical Type I Error Rates of SPACox, SPACox-NoSPA, Wald, Firth, and Score Tests From left to right, the plots considered five event rates (ERs) of 0.2%, 1%, 10%, 20%, and 50%. Top and bottom plots are for empirical type I error rates at α=5×105 and 5×108, respectively. Sample size n = 100,000. For each pair of MAF and event rate, we simulated 109 replications.
Figure 4
Figure 4
Empirical Powers of SPACox, Firth, Wald, Score, and SPACC Tests When γ Is Positive From left to right, the plots considered three MAFs of 0.01, 0.05, and 0.3. From top to bottom, the plots considered five ERs of 0.2%, 1%, 10%, 20%, and 50%. Empirical powers were evaluated at the significance level 5×108. Sample size n = 100,000. For each pair of MAF and event rate, we simulated 1,000 replications.
Figure 5
Figure 5
Manhattan Plots for 12 Phenotypes from UK Biobank Manhattan plots were based on p values calculated from the SPACox method. The red line represents the genome-wide significance level α=5×108.
Figure 6
Figure 6
p Values of SPACC and SPACox for 38 Highlighted SNPs from UK Biobank We highlight 38 loci that are significant on the basis of SPACox but not significant on the basis of SPACC. The red lines represent the genome-wide significance level α=5×108.

Similar articles

Cited by

References

    1. Kapoor M., Wang J.-C., Wetherill L., Le N., Bertelsen S., Hinrichs A.L., Budde J., Agrawal A., Almasy L., Bucholz K. Genome-wide survival analysis of age at onset of alcohol dependence in extended high-risk COGA families. Drug Alcohol Depend. 2014;142:56–62. - PMC - PubMed
    1. Huang Y.-T., Heist R.S., Chirieac L.R., Lin X., Skaug V., Zienolddiny S., Haugen A., Wu M.C., Wang Z., Su L. Genome-wide analysis of survival in early-stage non-small-cell lung cancer. J. Clin. Oncol. 2009;27:2660–2667. - PMC - PubMed
    1. Lin X., Cai T., Wu M.C., Zhou Q., Liu G., Christiani D.C., Lin X. Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies. Genet. Epidemiol. 2011;35:620–631. - PMC - PubMed
    1. Azzato E.M., Pharoah P.D., Harrington P., Easton D.F., Greenberg D., Caporaso N.E., Chanock S.J., Hoover R.N., Thomas G., Hunter D.J., Kraft P. A genome-wide association study of prognosis in breast cancer. Cancer Epidemiol. Biomarkers Prev. 2010;19:1140–1143. - PMC - PubMed
    1. Pillas D., Hoggart C.J., Evans D.M., O’Reilly P.F., Sipilä K., Lähdesmäki R., Millwood I.Y., Kaakinen M., Netuveli G., Blane D. Genome-wide association study reveals multiple loci associated with primary tooth development during infancy. PLoS Genet. 2010;6:e1000856. - PMC - PubMed

Publication types

LinkOut - more resources