Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 19;13(1):11662.
doi: 10.1038/s41598-023-37580-5.

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Affiliations

Biobank-scale methods and projections for sparse polygenic prediction from machine learning

Timothy G Raben et al. Sci Rep. .

Abstract

In this paper we characterize the performance of linear models trained via widely-used sparse machine learning algorithms. We build polygenic scores and examine performance as a function of training set size, genetic ancestral background, and training method. We show that predictor performance is most strongly dependent on size of training data, with smaller gains from algorithmic improvements. We find that LASSO generally performs as well as the best methods, judged by a variety of metrics. We also investigate performance characteristics of predictors trained on one genetic ancestry group when applied to another. Using LASSO, we develop a novel method for projecting AUC and correlation as a function of data size (i.e., for new biobanks) and characterize the asymptotic limit of performance. Additionally, for LASSO (compressed sensing) we show that performance metrics and predictor sparsity are in agreement with theoretical predictions from the Donoho-Tanner phase transition. Specifically, a future predictor trained in the Taiwan Precision Medicine Initiative for asthma can achieve an AUC of [Formula: see text] and for height a correlation of [Formula: see text] for a Taiwanese population. This is above the measured values of [Formula: see text] and [Formula: see text], respectively, for UK Biobank trained predictors applied to a European population.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: SDHH is a founder, shareholder, and serves on the Board of Directors of Genomic Prediction, Inc. (GP). EW and LL are employees and shareholders of GP. TGR declares no competing interests.

Figures

Figure 1
Figure 1
Comparison of sparse methods for asthma and height predictors with a comparison to prediction bands for more diverse biobanks. On the left, asthma predictors trained on a UKB white population. Predictors are built with LASSO, L1-penalized Logistic regression, Elastic Nets, and PRScs with UKB and 1,000 Genomes LD matrices. The specific parameters for the Elastic nets and PRScs are described in Section “Methods”. Similar results for the other phenotypes can be found in the Supplementary Information.
Figure 2
Figure 2
Left: affected sibling pair (ASP) selection rate for asthma. Pairs of siblings, where one person is a case and the other a control, are used and the rate corresponds to the number of times the case sibling has the higher PGS. The rate of correct selection, and uncertainty, increases if the siblings are also separated by at least 1.5, 2, or 2.5 standard deviations in PGS. Right: rank order selection rate for BMI respectively. The rate corresponds to frequency of the sibling with the larger BMI also having the larger PGS. Again the selection rate, and uncertainty (due to reduced statistics), increase if the sibling BMI is required to differ by at least 0.5, 1, or 1.5 standard deviations. Similar results for the other phenotypes are found in the Supplementary Information. These tests were developed in,. More detailed descriptions of these tests and how siblings are defined can be found in Section “Methods” and in the Supplementary Information.
Figure 3
Figure 3
Inclusive odds ratio (OR) for asthma. The inclusive OR is the ratio of all cases to controls at a given PGS or above normalized to the ratio of the total number of cases to controls. At the highest PGS bins, data is omitted if there are no cases or controls. Similar plots for the other phenotypes and details about how uncertainties are computed are all located in the Supplementary Information.
Figure 4
Figure 4
Growth of AUC (left: asthma) and correlation (right: BMI) as a function of training size in the UKB. Colored, curved bands come from fitting data with various 4 parameter functions. Width of the band corresponds to a confidence interval on the predictions: on the left 2 standard deviations or 68% and on the right 4 standard deviations or 95%. Vertical bars represent projections for de novo training in other biobanks using literature prevalences, summarized in the Supplementary Information. If one assumes that a phenotype is determined by the sum of a genetic component and another uncorrelated random component (i.e., P = G + E), then the heritability is simply the square of the correlation between P and G. On the right, this apprximation is used to convert heritability predictions from GCTA and LDSR to horizontal correlation bands.
Figure 5
Figure 5
Asthma active SNPs—i.e., SNPs with non-zero β weights—as training size is increased. The left axis shows the β value and is represented by colored dots. Different colors are used to differentiate chromosomes. The right axis represents the single SNP variance (SSV) normalized to the total SSV. The solid line showes the cumulative SSV. The “training” label represents the number of cases used in training. The first 10 (from the top) training sizes use equal number of cases and controls. The final training size uses all possible remaining controls.
Figure 6
Figure 6
Sparsity measurements, as a function of training size, for all 11 traits. Different markers correspond to different (arbitrary) estimated heritability groupings. Different colors correspond to different versions of sparsity. Heritablity here for case-control phenotypes is broad sense heritability reported from twin/family study literature, whereas GCTA was used to estimate heritability for continuous phenotypes. Low heritability traits (circles) include: atrial fibrillation, breast cancer, and BMI,. Medium heritability traits (squares) include: CAD, hypertension, direct bilirubin, height, and lipoprotein A,. High heritability traits (triangles) include: asthma and type 1/2 diabetes.
Figure 7
Figure 7
Estimates of the fraction of variance explained from purely genetic contributions for asthma and BMI. There are various ways to estimate the variance explained as explained in Section “Methods”. Similar plots for the other traits are found in the Supplementary Information.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061 (2010). - PMC - PubMed
    1. TOPMed https://www.nhlbiwgs.org/.
    1. UK Biobank Available online. http://www.ukbiobank.ac.uk/. Accessed: 21 March 2021.
    1. Taiwan Precision Medicine Initiative. https://tpmi.ibms.sinica.edu.tw/www/en/. Accessed 01 Feb 2023.
    1. All of Us Research Program Investigators. The “All of Us” research program. N. Engl. J. Med.381, 668–676 (2019). - PMC - PubMed

Publication types