Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 6;109(1):12-23.
doi: 10.1016/j.ajhg.2021.11.008.

Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort

Affiliations

Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort

Florian Privé et al. Am J Hum Genet. .

Erratum in

Abstract

The low portability of polygenic scores (PGSs) across global populations is a major concern that must be addressed before PGSs can be used for everyone in the clinic. Indeed, prediction accuracy has been shown to decay as a function of the genetic distance between the training and test cohorts. However, such cohorts differ not only in their genetic distance but also in their geographical distance and their data collection and assaying, conflating multiple factors. In this study, we examine the extent to which PGSs are transferable between ancestries by deriving polygenic scores for 245 curated traits from the UK Biobank data and applying them in nine ancestry groups from the same cohort. By restricting both training and testing to the UK Biobank data, we reduce the risk of environmental and genotyping confounding from using different cohorts. We define the nine ancestry groups at a sub-continental level, based on a simple, robust, and effective method that we introduce here. We then apply two different predictive methods to derive polygenic scores for all 245 phenotypes and show a systematic and dramatic reduction in portability of PGSs trained using Northwestern European individuals and applied to nine ancestry groups. These analyses demonstrate that prediction already drops off within European ancestries and reduces globally in proportion to genetic distance. Altogether, our study provides unique and robust insights into the PGS portability problem.

Keywords: ancestry; polygenic scores; portability.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests S.C. is a paid consultant to MyHeritage. The other authors declare no competing interests.

Figures

Figure 1
Figure 1
The first eight PC scores of the UK Biobank (field 22009) colored by the homogeneous ancestry group we infer for these individuals Only 50,000 individuals are represented at random. “NA” means that the corresponding individual is not categorized in any of the nine ancestry groups.
Figure 2
Figure 2
Partial correlation and 95% CI in the UK test set versus in a test set from another ancestry group Each point represents a phenotype and training has been performed with penalized regression on UK individuals (training 1 in Table 1) and HapMap3 variants. The slope (in blue) is computed using Deming regression accounting for standard errors in both x and y, fixing the intercept at 0. The square of this slope is provided above each plot, which we report as the relative predictive performance compared to testing in the “United Kingdom” ancestry group.
Figure 3
Figure 3
Relative variance explained compared to the UK versus PC distance from the UK PC distances are computed using Euclidean distance between geometric medians of the first 16 reported PC scores (field 22009) of each ancestry group. Relative performance values are the ones reported in Figure 2. The slope and standard errors are computed internally by function geom_smooth(method = “lm”) of R package ggplot2.
Figure 4
Figure 4
Zoomed Manhattan plot for lipoprotein(a) concentration The phenotypic variance explained per variant is computed as r2=t2/(n+t2), where t is the t-score from GWAS and n is the degrees of freedom (the sample size minus the number of variables in the model, i.e., the covariates used in the GWAS, the intercept, and the variant). The GWAS includes all variants with an imputation INFO score larger than 0.3 and within a 500 kb radius around the top hit from the GWAS performed in the UK training set and on the HapMap3 variants, represented by a vertical dotted line.
Figure 5
Figure 5
Predictive performance with LDpred2-auto for eight phenotypes, when using either HapMap3 variants or the 1M most significant variants One phenotype shown in each panel. Bars represent the 95% confidence intervals. Phecode 174.1: breast cancer; 185: prostate cancer; 411.4: coronary artery disease. HM3, HapMap3; top1M, the 1M most significant variants out of more than 8M common variants (see Material and methods).

Similar articles

Cited by

References

    1. Choi S.W., Mak T.S.-H., O’Reilly P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15:2759–2772. - PMC - PubMed
    1. de los Campos G., Gianola D., Allison D.B. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat. Rev. Genet. 2010;11:880–886. - PubMed
    1. Abraham G., Tye-Din J.A., Bhalala O.G., Kowalczyk A., Zobel J., Inouye M. Accurate and robust genomic prediction of celiac disease using statistical learning. PLoS Genet. 2014;10:e1004137. - PMC - PubMed
    1. Privé F., Aschard H., Blum M.G.B. Efficient implementation of penalized regression for genetic risk prediction. Genetics. 2019;212:65–74. - PMC - PubMed
    1. Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. - PMC - PubMed

Publication types