Tutorial: a guide to performing polygenic risk score analyses

Shing Wan Choi^{1

2}, Timothy Shin-Heng Mak³, Paul F O'Reilly^{4

5}

Affiliations

¹ MRC Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
² Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, New York, NY, USA.
³ Centre of Genomic Sciences, University of Hong Kong, Hong Kong, China.
⁴ MRC Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK. paul.oreilly@mssm.edu.
⁵ Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, New York, NY, USA. paul.oreilly@mssm.edu.

PMID: 32709988
PMCID: PMC7612115
DOI: 10.1038/s41596-020-0353-1

Review

Tutorial: a guide to performing polygenic risk score analyses

Shing Wan Choi et al. Nat Protoc. 2020 Sep.

. 2020 Sep;15(9):2759-2772.

doi: 10.1038/s41596-020-0353-1. Epub 2020 Jul 24.

Authors

Shing Wan Choi^{1

2}, Timothy Shin-Heng Mak³, Paul F O'Reilly^{4

5}

Affiliations

¹ MRC Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
² Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, New York, NY, USA.
³ Centre of Genomic Sciences, University of Hong Kong, Hong Kong, China.
⁴ MRC Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK. paul.oreilly@mssm.edu.
⁵ Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, New York, NY, USA. paul.oreilly@mssm.edu.

PMID: 32709988
PMCID: PMC7612115
DOI: 10.1038/s41596-020-0353-1

Abstract

A polygenic score (PGS) or polygenic risk score (PRS) is an estimate of an individual's genetic liability to a trait or disease, calculated according to their genotype profile and relevant genome-wide association study (GWAS) data. While present PRSs typically explain only a small fraction of trait variance, their correlation with the single largest contributor to phenotypic variation-genetic liability-has led to the routine application of PRSs across biomedical research. Among a range of applications, PRSs are exploited to assess shared etiology between phenotypes, to evaluate the clinical utility of genetic data for complex disease and as part of experimental studies in which, for example, experiments are performed that compare outcomes (e.g., gene expression and cellular response to treatment) between individuals with low and high PRS values. As GWAS sample sizes increase and PRSs become more powerful, PRSs are set to play a key role in research and stratified medicine. However, despite the importance and growing application of PRSs, there are limited guidelines for performing PRS analyses, which can lead to inconsistency between studies and misinterpretation of results. Here, we provide detailed guidelines for performing and interpreting PRS analyses. We outline standard quality control steps, discuss different methods for the calculation of PRSs, provide an introductory online tutorial, highlight common misconceptions relating to PRS results, offer recommendations for best practice and discuss future challenges.

PubMed Disclaimer

Figures

**Figure 1**
The Polygenic Risk Score (PRS) analysis process. PRS can be defined by their use of base and target data, as in Section 1. Quality control of both data sets is described in Section 2, while the different approaches to calculating PRS – e.g. LD adjustment via clumping, beta shrinkage using lasso regression, P-value thresholding – is summarised in Section 3. Issues relating to exploiting PRS for association analyses to test hypotheses, including interpretation of results and avoidance of overfitting to the data, are detailed in Section 4.

**Figure 2**
Shown is a flow chart of suggested analytical steps that can be followed to perform quality control and select software for PRS analyses. GenomicSEM [48] and MTAG [49] are software allow for joint analysis of summary statistics from GWAS of different complex traits and can help to boost power; Common PRS software include (but not limited to): PRSice [13,14], LDpred [45], PRS-CS [20], JAMPred [46], and lassosum [19]; PLINK [33,34] and bigsnpr [50] can be used to for the implementation of custom pipelines; and MultiPRS [51] is a method to perform PRS analyses on admixed population.

**Figure 3**
Illustration of major sources of inflation/deflation of PRS-trait associations. If the target data differs markedly from the base data in terms of allele frequencies, linkage disequilibrium, the environment, selection pressures etc, then the PRS-trait association will likely be deflated relative to had the target sample been well-matched to the base data (note that relative *inflation* is possible here if the trait has greater heritability in the target sample than the base sample [62]). Correlation between population structure of genetics and the environment can inflate PRS-trait associations unless fully controlled for. This inflation can be exacerbated by a household effect in which parents produce an environment reflecting their genetic tendencies [56], known as *passive* gene*environment correlation [63]. This figure illustrates in simple form some of the broad major influences on PRS-trait associations and their typical effects; it is not intended to capture the many nuances and exceptions involved or other important effects such as *evocative* or *active* genetic-environment correlations [63].

**Figure 4**
Results from a simulation study comparing Nagelkerke pseudo-R² with the pseudo-R² proposed by Lee et al [75] that incorporates adjustment for the sample case/control ratio. In the simulation, 2,000,000 samples were simulated to have a normally distributed phenotype, generated by a normally distributed predictor (e.g. a PRS) explaining a varying fraction of phenotypic variance, with a residual error term to model all other effects. Case/control status was then simulated under the liability threshold model according to a specified prevalence. 5,000 cases and 5,000 controls were then randomly selected from the population, and the R² of the original continuous data (Empirical R²), estimated by linear regression, was compared to both the Nagelkerke R² (discs) and the Lee R² (triangles) based on the corresponding case/control data by logistic regression.

**Figure 5**
Three different ways of representing the same data. The data correspond to Body Mass Index (BMI) PRS calculated in 388,155 individuals in the UK Biobank data, derived using GIANT BMI GWAS as base data. (a) is a quantile plot with 20 quantiles of increasing BMI PRS Vs mean BMI (Y-axis), (b) is a strata plot with unequal strata of increasing BMI PRS Vs prevalence (%) of severe obesity (BMI > 40), (c) is a strata plot with the same strata as in (b), but here each individual’s BMI value is shown on the Y-axis. Lateral spread within each stratum is to make individual points visible and red points correspond to individuals with severe obesity. Qualitatively similar patterns should be expected for PRS corresponding to all reasonably heritable continuous or binary traits, with strength of patterns dependent on the predictive power of the PRS (here the PRS explains ~5% of BMI). BMI here could be considered analogous to the liability underlying a disease in the liability threshold model, and in this way plot (c) may be helpful in imagining the uncertainty in the true liability that underlies a PRS value for a disease.

**Figure 6**
Examples of performance of PRS analyses on real data by validation sample size, according to (a) phenotypic variance explained (R²), (b) association P-value. UK Biobank data on Height (estimated heritability h² = 0.49 [8]), Forced Volume Capacity (FVC) (estimated heritability h ² = 0.23 [8]), Hand Grip (estimated heritability h² = 0.11 [8]), were randomly split into two sets of 100,000 individuals and used as base and target data, while the remaining sample was used as validation data of varying sample sizes, from 50 individuals to 3000 individuals. Each analysis was repeated 40 times with independently selected validation samples. While these results correspond to performance in validation data, the association P-values should reflect empirical P-values estimated from target data (as described in Section 4.7).

See this image and copyright information in PMC

References

1. Locke AE, Kahali B, Berndt SI, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518:197–206. - PMC - PubMed
1. Kunkle BW, Grenier-Boley B, Sims R, et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat Genet. 2019;51:414. - PMC - PubMed
1. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511:421–427. [ One of the first papers to highlight striking differential risk across quantiles of PRS. ] - PMC - PubMed
1. Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of heritability for human height. Nat Genet. 2010;42:565–569. - PMC - PubMed
1. Dudbridge F. Power and Predictive Accuracy of Polygenic Risk Scores. PLOS Genet. 2013;9:e1003348. [ A key theoretical PRS paper, providing the first analytical predictions of the performance of PRS analyses on real data. Formulae for computing expectations were derived according to factors such as trait heritability, base and target sample size, and polygenicity, assuming a quantitative genetics model. ] - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Tutorial: a guide to performing polygenic risk score analyses

Affiliations

Tutorial: a guide to performing polygenic risk score analyses

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials