Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 8;10(1):5086.
doi: 10.1038/s41467-019-12653-0.

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Affiliations

Improved polygenic prediction by Bayesian multiple regression on summary statistics

Luke R Lloyd-Jones et al. Nat Commun. .

Abstract

Accurate prediction of an individual's phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Prediction accuracy performance for the UKB genome-wide simulation. Each panel displays boxplot summaries of the prediction R2 (y-axis) or area under receiver-operating characteristic curve (AUC), in the 10,000 individual validation data set for each method (x-axis) across the 10 replicates. The simulation study contained eight scenarios that varied in the number of causal variants, 10,000 (10k) and 50,000 (10,000), and the true simulated heritability h2=(0.1,0.2,0.5). The two genetic architecture scenarios generated were 10,000 causal variants sampled under the SBayesR model, that is, 2500, 5000 and 2500 variants from each of N(0, 0.01), N(0, 0.1) and N(0, 1) distributions, respectively, and 50,000 causal variants sampled from a standard normal distribution. Case–control phenotypes were generated from the liability threshold model with a simulated disease prevalence of 0.05 and the 10,000 causal variant genetic architecture. In each panel, LDpred has two boxplot summaries, one that has been optimised for the polygenicity parameter and the other is LDpred-inf, which is displayed for comparison with SBLUP. LDpred and SBLUP were initialised with the true heritability parameter. The mean prediction accuracy across the 10 replicates is displayed above the boxplot for each method. The centre line inside the box is the median, the bottom and top of the box are the first and third quartiles, respectively (Q1 and Q3), and the lower and upper whiskers are Q1 – 1.5 IQR and Q3 + 1.5 IQR, respectively, where IQR = Q3 – Q1. The points depict the prediction accuracy for each replicate
Fig. 2
Fig. 2
Prediction accuracy in fivefold cross-validation for 12 traits in the UK Biobank. Panel headings describe the abbreviation for 12 traits including standing height (HEIGHT, n = 347,106), male-pattern baldness (MPB, n = 125,157), basal metabolic rate (BMR, n = 341,819), heel bone mineral density T-score (hBMD, n = 197,789), forced vital capacity (FVC, n = 317,502), type 2 diabetes (T2D, n = 274,271), body mass index (BMI, n = 346,738), body fat percentage (BFP, n = 341,633), forced expiratory volume in 1 s (FEV, n = 317,502), hip circumference (HC, n = 347,231), waist-to-hip ratio (WHR, n = 347,198) and birth weight (BW, n = 197,778). Each panel shows a boxplot summary of the prediction accuracy across the five folds with the mean across the five folds displayed above each method’s boxplot. The centre line inside the box is the median, the bottom and top of the box are the first and third quartiles, respectively (Q1 and Q3), and the lower and upper whiskers are Q1 – 1.5 IQR and Q3 + 1.5 IQR, respectively, where IQR = Q3 – Q1. The points depict the prediction accuracy for each replicate. Traits are ordered by mean estimated hSNP2 (see Supplementary Fig. 16) from highest to lowest
Fig. 3
Fig. 3
Across-biobank prediction accuracy for height and BMI. Panels depict prediction R2 (y-axis) generated from regression of the predicted phenotype on the observed phenotype for BMI and height for different methods in the independent HRS and ESTB data sets. P + T refers to the prediction R2 generated from the summary statistics of Yengo et al. (n ≈ 700,000), which included 6781 SNPs for BMI and 11,816 SNPs for height from a GCTA–COJO analysis thresholded at Wald test p value <0.001. The BayesR predictions were calculated by using 1,094,841 HM3 variants estimated from the full set of unrelated and related UKB European individuals (n = 453,458 and n = 454,047 for BMI and height, respectively). Summary statistics for SBayesR 2.8 million variant (SBayesR 2.8M) analysis for the UKB European individuals were generated by using the BOLT-LMM software. All other prediction R2 results were generated by using summary statistics methodology and were calculated from the analysis of summary statistics from Yengo et al. for 909,293 and 932,969 variants for BMI and height that overlapped with the 1,094,841 HM3 variants set used for the UKB analyses. The overlap of the sets of variants used in each of the analyses and those available in the imputed HRS and ESTB data sets for prediction had a minimum value of 98%. Supplementary Table 1 details further the figure results

References

    1. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk of complex disease. Curr. Opin. Genet. Dev. 2008;18:257–263. doi: 10.1016/j.gde.2008.07.006. - DOI - PubMed
    1. Katsanis SH, Katsanis N. Molecular genetic testing and the future of clinical genomics. Nat. Rev. Genet. 2013;14:415. doi: 10.1038/nrg3493. - DOI - PMC - PubMed
    1. Aronson SJ, Rehm HL. Building the foundation for genomics in precision medicine. Nature. 2015;526:336. doi: 10.1038/nature15816. - DOI - PMC - PubMed
    1. Chatterjee N, Shi J, García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 2016;17:392. doi: 10.1038/nrg.2016.27. - DOI - PMC - PubMed
    1. Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet.1 581–590 (2018). - PubMed

Publication types

MeSH terms