Efficient Implementation of Penalized Regression for Genetic Risk Prediction

Florian Privé¹, Hugues Aschard², Michael G B Blum¹

Affiliations

¹ Laboratoire TIMC-IMAG, UMR 5525, University of Grenoble Alpes, CNRS, 38700 La Tronche, France florian.prive@univ-grenoble-alpes.fr michael.blum@univ-grenoble-alpes.fr.
² Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, 75015 Paris, France.

PMID: 30808621
PMCID: PMC6499521
DOI: 10.1534/genetics.119.302019

Efficient Implementation of Penalized Regression for Genetic Risk Prediction

Florian Privé et al. Genetics. 2019 May.

. 2019 May;212(1):65-74.

doi: 10.1534/genetics.119.302019. Epub 2019 Feb 26.

Authors

Florian Privé¹, Hugues Aschard², Michael G B Blum¹

Affiliations

¹ Laboratoire TIMC-IMAG, UMR 5525, University of Grenoble Alpes, CNRS, 38700 La Tronche, France florian.prive@univ-grenoble-alpes.fr michael.blum@univ-grenoble-alpes.fr.
² Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, 75015 Paris, France.

PMID: 30808621
PMCID: PMC6499521
DOI: 10.1534/genetics.119.302019

Abstract

Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The "Clumping+Thresholding" (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.

Keywords: GenPred; LASSO; SNP; genomic prediction; polygenic risk scores; shared data resources.

PubMed Disclaimer

Figures

**Figure 1**
Main comparison of C+T and PLR when simulating phenotypes with additive effects (scenario N°1, model “ADD”). Mean AUC over 100 simulations for PLR and the maximum AUC reported with “C+T-max” (clumping threshold at $r^{2} > 0.2$ ). Upper (lower) panels present results for effects following a Gaussian (Laplace) distribution, and left (right) panels present results for a heritability of 0.5 (0.8). Error bars are representing $\pm 2 SD$ of $10^{5}$ nonparametric bootstrap of the mean AUC. The blue dotted line represents the maximum achievable AUC.

**Figure 2**
Comparison of methods when simulating phenotypes with additive effects and using chromosome six only (scenario N°2). Thinner lines represent results in scenario N°1. Mean AUC over 100 simulations for PLR and the maximum values of C+T for three different $r^{2}$ thresholds (0.05, 0.2, and 0.8) as a function of the number and location of causal SNPs. Upper (lower) panels present results for effects following a Gaussian (Laplace) distribution and left (right) panels present results for a heritability of 0.5 (0.8). Error bars representing $\pm 2 SD$ of $10^{5}$ nonparametric bootstrap of the mean AUC. The blue dotted line represents the maximum achievable AUC.

**Figure 3**
Comparison of methods when simulating 300 causal SNPs with additive effects and when varying sample size (scenario N°3). Mean AUC over 100 simulations for the maximum values of C+T for three different $r^{2}$ thresholds (0.05, 0.2, and 0.8) and PLR as a function of the training size. Upper (lower) panels are presenting results for effects following a Gaussian (Laplace) distribution and left (right) panels are presenting results for a heritability of 0.5 (0.8). Error bars represent $\pm 2 SD$ of $10^{5}$ nonparametric bootstrap of the mean AUC. The blue dotted line represents the maximum achievable AUC.

See this image and copyright information in PMC

References

1. Abraham G., Kowalczyk A., Zobel J., Inouye M., 2012. Sparsnp: fast and memory-efficient analysis of all snps for phenotype prediction. BMC Bioinformatics 13: 88 10.1186/1471-2105-13-88 - DOI - PMC - PubMed
1. Abraham G., Kowalczyk A., Zobel J., Inouye M., 2013. Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease. Genet. Epidemiol. 37: 184–195. 10.1002/gepi.21698 - DOI - PubMed
1. Abraham G., Tye-Din J. A., Bhalala O. G., Kowalczyk A., Zobel J., et al. , 2014. Accurate and robust genomic prediction of celiac disease using statistical learning. PLoS Genet. 10: e1004137 (erratum: PLoS Genet. 10: e1004374). 10.1371/journal.pgen.1004137 - DOI - PMC - PubMed
1. Botta V., Louppe G., Geurts P., Wehenkel L., 2014. Exploiting SNP correlations within random forest for genome-wide association studies. PLoS One 9: e93379 10.1371/journal.pone.0093379 - DOI - PMC - PubMed
1. Breiman L., 2001. Random forests. Mach. Learn. 45: 5–32. 10.1023/A:1010933404324 - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient Implementation of Penalized Regression for Genetic Risk Prediction

Affiliations

Efficient Implementation of Penalized Regression for Genetic Risk Prediction

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources