Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec 5;105(6):1213-1221.
doi: 10.1016/j.ajhg.2019.11.001. Epub 2019 Nov 21.

Making the Most of Clumping and Thresholding for Polygenic Scores

Affiliations

Making the Most of Clumping and Thresholding for Polygenic Scores

Florian Privé et al. Am J Hum Genet. .

Abstract

Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p value threshold to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.

Keywords: C+T; PRS; UK Biobank; clumping and thresholding; complex traits; polygenic risk scores; stacking.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Results of the Six Simulation Scenarios with Well-Imputed Variants Scenarios are (100) 100 random causal variants; (10K) 10,000 random causal variants; (1M) all 1M variants are causal variants; (2chr) 100 variants of chromosome 1 are causal and all variants of chromosome 2, with half of the heritability for both chromosomes; (err) 10,000 random causal variants, but 10% of the GWAS effects are reported with an opposite effect; (HLA) 7,105 causal variants in a long-range LD region of chromosome 6. Mean and 95% CI of 104 non-parametric bootstrap replicates of the mean AUC of 10 simulations for each scenario. The blue dotted line represents the maximum achievable AUC for these simulations (87.5% for a prevalence of 10% and an heritability of 50%; see Equation 3 of Wray et al.30). See corresponding values in Table S1.
Figure 2
Figure 2
Results of the Real Data Applications with Large Training Size AUC values on the test set of UKBB (mean and 95% CI from 104 bootstrap samples). Training SCT and choosing optimal hyper-parameters for C+T and lassosum use 63%–90% of the individuals reported in Table 1. See corresponding values in Table S2.

References

    1. Wray N.R., Goddard M.E., Visscher P.M. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528. - PMC - PubMed
    1. Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O’Donovan M.C., Sullivan P.F., Sklar P., International Schizophrenia Consortium Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. - PMC - PubMed
    1. Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348. - PMC - PubMed
    1. Wray N.R., Lee S.H., Mehta D., Vinkhuyzen A.A., Dudbridge F., Middeldorp C.M. Research review: Polygenic methods and their application to psychiatric traits. J. Child Psychol. Psychiatry. 2014;55:1068–1087. - PubMed
    1. Euesden J., Lewis C.M., O’Reilly P.F. PRSice: polygenic risk score software. Bioinformatics. 2015;31:1466–1468. - PMC - PubMed

Publication types