Efficient Implementation of Penalized Regression for Genetic Risk Prediction
- PMID: 30808621
- PMCID: PMC6499521
- DOI: 10.1534/genetics.119.302019
Efficient Implementation of Penalized Regression for Genetic Risk Prediction
Abstract
Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The "Clumping+Thresholding" (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.
Keywords: GenPred; LASSO; SNP; genomic prediction; polygenic risk scores; shared data resources.
Copyright © 2019 Privé et al.
Figures



Similar articles
-
Making the Most of Clumping and Thresholding for Polygenic Scores.Am J Hum Genet. 2019 Dec 5;105(6):1213-1221. doi: 10.1016/j.ajhg.2019.11.001. Epub 2019 Nov 21. Am J Hum Genet. 2019. PMID: 31761295 Free PMC article.
-
Fast and scalable ensemble learning method for versatile polygenic risk prediction.Proc Natl Acad Sci U S A. 2024 Aug 13;121(33):e2403210121. doi: 10.1073/pnas.2403210121. Epub 2024 Aug 7. Proc Natl Acad Sci U S A. 2024. PMID: 39110727 Free PMC article.
-
Leveraging both individual-level genetic data and GWAS summary statistics increases polygenic prediction.Am J Hum Genet. 2021 Jun 3;108(6):1001-1011. doi: 10.1016/j.ajhg.2021.04.014. Epub 2021 May 7. Am J Hum Genet. 2021. PMID: 33964208 Free PMC article.
-
Polygenic Risk Score in African populations: progress and challenges.F1000Res. 2023 Apr 11;11:175. doi: 10.12688/f1000research.76218.2. eCollection 2022. F1000Res. 2023. PMID: 37273966 Free PMC article. Review.
-
Methodologies underpinning polygenic risk scores estimation: a comprehensive overview.Hum Genet. 2024 Nov;143(11):1265-1280. doi: 10.1007/s00439-024-02710-0. Epub 2024 Oct 19. Hum Genet. 2024. PMID: 39425790 Free PMC article. Review.
Cited by
-
Screening Human Embryos for Polygenic Traits Has Limited Utility.Cell. 2019 Nov 27;179(6):1424-1435.e8. doi: 10.1016/j.cell.2019.10.033. Epub 2019 Nov 21. Cell. 2019. PMID: 31761530 Free PMC article.
-
Biobanks in GENETICS and G3: tackling the statistical challenges.G3 (Bethesda). 2025 Apr 17;15(4):jkaf060. doi: 10.1093/g3journal/jkaf060. G3 (Bethesda). 2025. PMID: 40244884 Free PMC article. No abstract available.
-
Cancer PRSweb: An Online Repository with Polygenic Risk Scores for Major Cancer Traits and Their Evaluation in Two Independent Biobanks.Am J Hum Genet. 2020 Nov 5;107(5):815-836. doi: 10.1016/j.ajhg.2020.08.025. Epub 2020 Sep 28. Am J Hum Genet. 2020. PMID: 32991828 Free PMC article.
-
Efficient blockLASSO for polygenic scores with applications to all of us and UK Biobank.BMC Genomics. 2025 Mar 27;26(1):302. doi: 10.1186/s12864-025-11505-0. BMC Genomics. 2025. PMID: 40148775 Free PMC article.
-
Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort.Am J Hum Genet. 2022 Jan 6;109(1):12-23. doi: 10.1016/j.ajhg.2021.11.008. Am J Hum Genet. 2022. PMID: 34995502 Free PMC article.
References
-
- Breiman L., 2001. Random forests. Mach. Learn. 45: 5–32. 10.1023/A:1010933404324 - DOI
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources