Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Bjarni J Vilhjálmsson¹, Jian Yang², Hilary K Finucane³, Alexander Gusev⁴, Sara Lindström⁵, Stephan Ripke⁶, Giulio Genovese⁷, Po-Ru Loh⁴, Gaurav Bhatia⁴, Ron Do⁸, Tristan Hayeck⁴, Hong-Hee Won⁹; Schizophrenia Working Group of the Psychiatric Genomics Consortium, Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study; Sekar Kathiresan⁹, Michele Pato¹⁰, Carlos Pato¹⁰, Rulla Tamimi¹¹, Eli Stahl¹², Noah Zaitlen¹³, Bogdan Pasaniuc¹⁴, Gillian Belbin⁸, Eimear E Kenny¹⁵, Mikkel H Schierup¹⁶, Philip De Jager¹⁷, Nikolaos A Patsopoulos¹⁷, Steve McCarroll⁷, Mark Daly¹⁸, Shaun Purcell¹², Daniel Chasman¹⁹, Benjamin Neale¹⁸, Michael Goddard²⁰, Peter M Visscher², Peter Kraft²¹, Nick Patterson²², Alkes L Price²³

Collaborators, Affiliations

PMID: 26430803
PMCID: PMC4596916
DOI: 10.1016/j.ajhg.2015.09.001

Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Bjarni J Vilhjálmsson et al. Am J Hum Genet. 2015.

. 2015 Oct 1;97(4):576-92.

doi: 10.1016/j.ajhg.2015.09.001.

PMID: 26430803
PMCID: PMC4596916
DOI: 10.1016/j.ajhg.2015.09.001

Abstract

Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R(2) increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.

PubMed Disclaimer

Figures

**Figure 1**
Prediction Accuracy of P+T Applied to Simulated Genotypes with and without LD The performance of P+T, PRSs based on LD-pruned SNPs (r² < 0.2) followed by p value thresholding with an optimized threshold, when applied to simulated genotypes with and without LD. The prediction accuracy, as measured by squared correlation between the true phenotypes and the PRSs (prediction R²), is plotted as a function of the training sample size. The results are averaged over 1,000 simulated traits with 200,000 simulated genotypes, where the fraction of causal variants p was allowed to vary. In (A), the simulated genotypes are unlinked. In (B), the simulated genotypes are linked; we simulated independent batches of 100 markers while fixing the squared correlation between adjacent variants in a batch at 0.9.

**Figure 2**
Comparison of Four Prediction Methods Applied to Simulated Traits Prediction accuracy of the four different methods listed in Table S1 when applied to simulated traits with WTCCC genotypes. The four subfigures correspond to p = 1 (A), p = 0.1 (B), p = 0.01 (C), and p = 0.001 (D) for the fraction of simulated causal markers with (non-zero) effect sizes sampled from a Gaussian distribution. To aid interpretation of the results, we plot the accuracy against the effective sample size, defined as $N_{eff} = (N / M_{sim}) M$ , where N = 10,786 is the training sample size, *M =* 376,901 is the total number of SNPs, and $M_{sim}$ is the actual number of SNPs used in each simulation: 376,901 (all chromosomes), 112,185 (chromosomes 1–4), 61,689 (chromosomes 1 and 2), and 30,004 (chromosome 1). The effective sample size is the sample size that maintains the same N/M ratio if all SNPs are used.

**Figure 3**
Comparison of Methods Applied to Seven WTCCC Disease Datasets The prediction accuracy of different methods as estimated from 5-fold cross-validation in seven WTCCC disease datasets: type 1 diabetes (T1D), rheumatoid arthritis (RA), Crohn disease (CD), bipolar disease (BD), type 2 diabetes (T2D), hypertension (HT), and coronary artery disease (CAD). The Nagelkerke prediction R² is shown on the y axis (see Table S2 for other metrics). LDpred significantly improved the prediction accuracy for the immune-related diseases T1D, RA, and CD (see main text).

**Figure 4**
Comparison of Methods Training on Large GWAS Summary Statistics for Five Different Diseases The prediction accuracy is shown for five different diseases: schizophrenia (SCZ), multiple sclerosis (MS), breast cancer (BC), type 2 diabetes (T2D), and coronary artery disease (CAD). The risk scores were trained with large GWAS summary-statistics datasets and used for predicting disease risk in independent validation datasets. The Nagelkerke prediction R² is shown on the y axis (see Table S5 for other metrics). Compared to LD pruning + thresholding (P+T), LDpred improved the prediction R² by 11%–25%. SCZ results are shown for the SCZ-MGS validation cohort used in recent studies, but LDpred also produced a large improvement for the independent SCZ-ISC validation cohort (Table S5).

See this image and copyright information in PMC

References

1. Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O’Donovan M.C., Sullivan P.F., Sklar P., International Schizophrenia Consortium Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. - PMC - PubMed
1. Pharoah P.D., Antoniou A.C., Easton D.F., Ponder B.A. Polygenes, risk prediction, and targeted prevention of breast cancer. N. Engl. J. Med. 2008;358:2796–2803. - PubMed
1. Evans D.M., Visscher P.M., Wray N.R. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum. Mol. Genet. 2009;18:3525–3531. - PubMed
1. Wei Z., Wang K., Qu H.Q., Zhang H., Bradfield J., Kim C., Frackleton E., Hou C., Glessner J.T., Chiavacci R. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 2009;5:e1000678. - PMC - PubMed
1. Speliotes E.K., Willer C.J., Berndt S.I., Monda K.L., Thorleifsson G., Jackson A.U., Lango Allen H., Lindgren C.M., Luan J., Mägi R., MAGIC. Procardis Consortium Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 2010;42:937–948. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Research Materials