. 2023 Jul 19;13(1):11662.

doi: 10.1038/s41598-023-37580-5.

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Timothy G Raben¹, Louis Lello^{2

3}, Erik Widen^{2

3}, Stephen D H Hsu^{2

3}

Affiliations

¹ Department of Physics and Astronomy, Michigan State University, Michigan, USA. rabentim@msu.edu.
² Department of Physics and Astronomy, Michigan State University, Michigan, USA.
³ Genomic Prediction, Inc., North Brunswick, NJ, USA.

PMID: 37468507
PMCID: PMC10356957
DOI: 10.1038/s41598-023-37580-5

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Timothy G Raben et al. Sci Rep. 2023.

. 2023 Jul 19;13(1):11662.

doi: 10.1038/s41598-023-37580-5.

Authors

Timothy G Raben¹, Louis Lello^{2

3}, Erik Widen^{2

3}, Stephen D H Hsu^{2

3}

Affiliations

¹ Department of Physics and Astronomy, Michigan State University, Michigan, USA. rabentim@msu.edu.
² Department of Physics and Astronomy, Michigan State University, Michigan, USA.
³ Genomic Prediction, Inc., North Brunswick, NJ, USA.

PMID: 37468507
PMCID: PMC10356957
DOI: 10.1038/s41598-023-37580-5

Abstract

In this paper we characterize the performance of linear models trained via widely-used sparse machine learning algorithms. We build polygenic scores and examine performance as a function of training set size, genetic ancestral background, and training method. We show that predictor performance is most strongly dependent on size of training data, with smaller gains from algorithmic improvements. We find that LASSO generally performs as well as the best methods, judged by a variety of metrics. We also investigate performance characteristics of predictors trained on one genetic ancestry group when applied to another. Using LASSO, we develop a novel method for projecting AUC and correlation as a function of data size (i.e., for new biobanks) and characterize the asymptotic limit of performance. Additionally, for LASSO (compressed sensing) we show that performance metrics and predictor sparsity are in agreement with theoretical predictions from the Donoho-Tanner phase transition. Specifically, a future predictor trained in the Taiwan Precision Medicine Initiative for asthma can achieve an AUC of [Formula: see text] and for height a correlation of [Formula: see text] for a Taiwanese population. This is above the measured values of [Formula: see text] and [Formula: see text], respectively, for UK Biobank trained predictors applied to a European population.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: SDHH is a founder, shareholder, and serves on the Board of Directors of Genomic Prediction, Inc. (GP). EW and LL are employees and shareholders of GP. TGR declares no competing interests.

Figures

**Figure 1**
Comparison of sparse methods for asthma and height predictors with a comparison to prediction bands for more diverse biobanks. On the left, asthma predictors trained on a UKB white population. Predictors are built with LASSO, $L_{1}$ -penalized Logistic regression, Elastic Nets, and PRScs with UKB and 1,000 Genomes LD matrices. The specific parameters for the Elastic nets and PRScs are described in Section “Methods”. Similar results for the other phenotypes can be found in the Supplementary Information.

**Figure 2**
Left: affected sibling pair (ASP) selection rate for asthma. Pairs of siblings, where one person is a case and the other a control, are used and the rate corresponds to the number of times the case sibling has the higher PGS. The rate of correct selection, and uncertainty, increases if the siblings are also separated by at least 1.5, 2, or 2.5 standard deviations in PGS. Right: rank order selection rate for BMI respectively. The rate corresponds to frequency of the sibling with the larger BMI also having the larger PGS. Again the selection rate, and uncertainty (due to reduced statistics), increase if the sibling BMI is required to differ by at least 0.5, 1, or 1.5 standard deviations. Similar results for the other phenotypes are found in the Supplementary Information. These tests were developed in^,. More detailed descriptions of these tests and how siblings are defined can be found in Section “Methods” and in the Supplementary Information.

**Figure 3**
Inclusive odds ratio (OR) for asthma. The inclusive OR is the ratio of all cases to controls *at a given PGS or above* normalized to the ratio of the total number of cases to controls. At the highest PGS bins, data is omitted if there are no cases or controls. Similar plots for the other phenotypes and details about how uncertainties are computed are all located in the Supplementary Information.

**Figure 4**
Growth of AUC (left: asthma) and correlation (right: BMI) as a function of training size in the UKB. Colored, curved bands come from fitting data with various 4 parameter functions. Width of the band corresponds to a confidence interval on the predictions: on the left 2 standard deviations or $\sim 68 %$ and on the right 4 standard deviations or $\sim 95 %$ . Vertical bars represent projections for de novo training in other biobanks using literature prevalences, summarized in the Supplementary Information. If one assumes that a phenotype is determined by the sum of a genetic component and another *uncorrelated* random component (i.e., P = G + E), then the heritability is simply the square of the correlation between P and G. On the right, this apprximation is used to convert heritability predictions from GCTA and LDSR to horizontal correlation bands.

**Figure 5**
Asthma active SNPs—i.e., SNPs with non-zero $β$ weights—as training size is increased. The left axis shows the $β$ value and is represented by colored dots. Different colors are used to differentiate chromosomes. The right axis represents the single SNP variance (SSV) normalized to the total SSV. The solid line showes the cumulative SSV. The “training” label represents the number of cases used in training. The first 10 (from the top) training sizes use equal number of cases and controls. The final training size uses all possible remaining controls.

**Figure 6**
Sparsity measurements, as a function of training size, for all 11 traits. Different markers correspond to different (arbitrary) estimated heritability groupings. Different colors correspond to different versions of sparsity. Heritablity here for case-control phenotypes is *broad sense* heritability reported from twin/family study literature, whereas GCTA was used to estimate heritability for continuous phenotypes. Low heritability traits (circles) include: atrial fibrillation, breast cancer, and BMI^,. Medium heritability traits (squares) include: CAD, hypertension, direct bilirubin, height, and lipoprotein A^,. High heritability traits (triangles) include: asthma and type 1/2 diabetes^–.

**Figure 7**
Estimates of the fraction of variance explained from *purely genetic* contributions for asthma and BMI. There are various ways to estimate the variance explained as explained in Section “Methods”. Similar plots for the other traits are found in the Supplementary Information.

See this image and copyright information in PMC

Cited by

Validation of GenProb-T1D and its clinical utility for differentiating types of diabetes in a biobank from a US healthcare system.
Billings LK, Shi Z, Mulford AJ, Wei J, Tran H, Ashworth A, Zheng SL, Dunnenberger HM, Hulick PJ, Sanders AR, Xu J. Billings LK, et al. J Diabetes Investig. 2025 Jan;16(1):10-15. doi: 10.1111/jdi.14297. Epub 2024 Aug 22. J Diabetes Investig. 2025. PMID: 39171755 Free PMC article.
Efficient blockLASSO for polygenic scores with applications to all of us and UK Biobank.
Raben TG, Lello L, Widen E, Hsu SDH. Raben TG, et al. BMC Genomics. 2025 Mar 27;26(1):302. doi: 10.1186/s12864-025-11505-0. BMC Genomics. 2025. PMID: 40148775 Free PMC article.
EndoPRS: Incorporating endophenotype information to improve polygenic risk scores for clinical endpoints-A study in asthma.
Kharitonova EV, Sun Q, Ockerman F, Chen B, Zhou LY, Hysong MR, Tuftin B, Cao H, Mathias RA, Auer PL, Ober C, Raffield LM, Reiner AP, Cox NJ, Kelada SNP, Tao R, Li Y. Kharitonova EV, et al. Am J Hum Genet. 2025 May 1;112(5):1199-1214. doi: 10.1016/j.ajhg.2025.03.008. Epub 2025 Apr 8. Am J Hum Genet. 2025. PMID: 40203832
EndoPRS: Incorporating Endophenotype Information to Improve Polygenic Risk Scores for Clinical Endpoints.
Kharitonova EV, Sun Q, Ockerman F, Chen B, Zhou LY, Cao H, Mathias RA, Auer PL, Ober C, Raffield LM, Reiner AP, Cox NJ, Kelada S, Tao R, Li Y. Kharitonova EV, et al. medRxiv [Preprint]. 2024 May 24:2024.05.23.24307839. doi: 10.1101/2024.05.23.24307839. medRxiv. 2024. Update in: Am J Hum Genet. 2025 May 1;112(5):1199-1214. doi: 10.1016/j.ajhg.2025.03.008. PMID: 38826253 Free PMC article. Updated. Preprint.
Polygenic height prediction for the Han Chinese in Taiwan.
Chang CH, Chou CY, Raben TG, Chen SA, Jong YJ, Wu JY, Yang SF, Chen HC, Chen YL, Chen M, Ma GC, Huang CY, Wang TF, Lee SL, Hung CF, Pang ST, Widen E, Chang YM, Yeh EC, Wei CY, Chen CH, Hsu SDH, Kwok PY. Chang CH, et al. NPJ Genom Med. 2025 Feb 5;10(1):7. doi: 10.1038/s41525-025-00468-6. NPJ Genom Med. 2025. PMID: 39910149 Free PMC article.

References

1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061 (2010). - PMC - PubMed
1. TOPMed https://www.nhlbiwgs.org/.
1. UK Biobank Available online. http://www.ukbiobank.ac.uk/. Accessed: 21 March 2021.
1. Taiwan Precision Medicine Initiative. https://tpmi.ibms.sinica.edu.tw/www/en/. Accessed 01 Feb 2023.
1. All of Us Research Program Investigators. The “All of Us” research program. N. Engl. J. Med.381, 668–676 (2019). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Affiliations

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical