Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 14;24(1):52.
doi: 10.1186/s12863-023-01151-4.

Overestimated prediction using polygenic prediction derived from summary statistics

Affiliations

Overestimated prediction using polygenic prediction derived from summary statistics

David Keetae Park et al. BMC Genom Data. .

Abstract

Background: When polygenic risk score (PRS) is derived from summary statistics, independence between discovery and test sets cannot be monitored. We compared two types of PRS studies derived from raw genetic data (denoted as rPRS) and the summary statistics for IGAP (sPRS).

Results: Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. sPRS without APOE is derived from International Genomics of Alzheimer's Project (IGAP), which records ΔAUC and ΔR2 of 0.051 ± 0.013 and 0.063 ± 0.015 for Alzheimer's Disease Sequencing Project (ADSP) and 0.060 and 0.086 for Accelerating Medicine Partnership - Alzheimer's Disease (AMP-AD). On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and test sets are 0.0036 ± 0.0027 (ΔAUC) and 0.0032 ± 0.0028 (ΔR2). For height, ΔR2 is 0.029 ± 0.0037.

Conclusion: Considering the high heritability of hypertension and height of UK Biobank and sample size of UK Biobank, sPRS results from AD databases are inflated. Independence between discovery and test sets is a well-known basic requirement for PRS studies. However, a lot of PRS studies cannot follow such requirements because of impossible direct comparisons when using summary statistics. Thus, for sPRS, potential duplications should be carefully considered within the same ethnic group.

Keywords: Alzheimer’s disease; Complex genetic disease; Overestimation bias; Polygenic risk score.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Overview of the study. (A) (i) Overlapping subjects are observed between AD genetic initiatives. (ii) There is no overlapping subject across ethnicities. Until now, trans-ethnic applications of PRS have been limited. We suspect that subject overlap within an ethnicity is one of the key factors to explain overestimated performances, which motivates this study. We divide PRS into two cases, where rPRS represents when the genetic information is provided and used as the discovery set and sPRS stands for the case when GWAS is pre-conducted and only summary statistics are provided. (B) For rPRS, overlapping subjects (n = 432) between ADSP and AMP-AD are identified, which breaks the independence assumption and causes the overestimation bias. For sPRS, the overlapping ratio cannot be examined by giving the summary statistics. However, the suspected inflation in the AD prediction performance (denoted by sPRS - rPRS) motivates further analysis of the scale effect of the datasets because IGAP has a larger number of samples. (C) (i) Two new variables, hypertension and height, from the UK Biobank database are introduced to compute the upper bounds of the scale effect. Hypertension and height have a higher heritability than AD. Thus, they act as the upper bounds for AD over PRS performances (shown in the QQ plot). (ii) In AD, the gap between sPRS and rPRS (area shaded in green) is attributable to either the overestimation bias or the scale effect of the sample size of the discovery set. Because UK Biobank consists of a larger number of samples (n = 342,318), the scale effect can be measured via computing the performance gains per sample unit. Cohort case counts and their percentages of the total were as follows: ADSP had 5687 (55.2%), AMP-AD had 696 (61.4%), IGAP had 17,008 (31.4%), and UK Biobank had 82,719 (24.2%)
Fig. 2
Fig. 2
PRS performance comparisons for Alzheimer’s disease. ΔAUC and ΔR2 denote the additive gain from introducing PRS term to Model II (refer to Materials and Methods for details). For convenience, we abbreviate the discovery and test sets as D and T, respectively. (A) AD prediction performances with and without subject overlap (D: ADSP, T: AMP-AD). All metrics of overlapping subjects are overestimated, growing in an increasing number of SNPs. (B) sPRS (D: IGAP, T: ADSP) is compared to rPRS (D: ADSP, T: ADSP). (C) AMP-AD data is another T for rPRS (D: ADSP) and sPRS (D: IGAP). D and T of ADSP data are derived from tenfold cross-validation. In both (B) and (C), sPRS performances are significantly higher than rPRS, and we suspect that some participants of IGAP are identical to a subset of ADSP or AMP-AD. (D) A simulated study is conducted with rPRS (D: ADSP, T: AMP-AD), in which a subset of D replaces a growing number of subjects in T (see Results for details). The number of SNPs in the x-axis denotes number of the LD pruned SNPs selected in the order from the lowest P-value thresholds. That is, the lower number of SNP in the left side means the stricter P value threshold and the right-most side is the most generous P value threshold (P < 0.5)
Fig. 3
Fig. 3
PRS performance comparisons via UK Biobank. In this study, UK Biobank’s primary purpose is to evaluate the scale effect, defined as the marginal gain of performance due to the size of the discovery set. To this end, two variables representative for high heritability, namely hypertension, and height, are analyzed. For experimental purposes, we intentionally design three discovery sets with different sizes, 300k, 60k, and 9k, which approximately correspond to the discovery set sizes of the full UK Biobank dataset, IGAP, and ADSP, respectively. For convenience, we abbreviate the discovery and test sets as D and T. ΔAUC and ΔR2 denote the additive gain from introducing the PRS term to Model II (refer to Materials and Methods for details). (A) A larger D size results in higher prediction performances (ΔAUC and ΔR2), demonstrating the scale effect as hypothesized. However, in the three sample sizes, a smaller subset of T rarely degrades ΔAUC or ΔR2, but it had an impact on the significance level P, perhaps intuitively. As the highest heritability (Fig. 1C) foretells, the height variable applied in PRS showed a greater impact on the prediction model than hypertension, as indicated by higher ΔR2 and –log(P). (B) When the number of SNPs varies with 100% of T used, most metrics show improvements until 50k SNPs are used, which plateaus. The number of SNPs in the x-axis denotes number of the LD pruned SNPs selected in the order from the lowest P-value thresholds. That is, the lower number of SNP in the left side means the stricter P value threshold and the right-most side is the most generous P value threshold (P < 0.5). (C) Although the size of D with 100% of T used shows a linear correlation with PRS performances, proving the hypothesized scale effect, the improvements are not dramatic. For instance, ΔR2 increases by approximately 0.0000125 and 0.0000083 per 3k of D

References

    1. Euesden J, Lewis CM, O’Reilly PF. PRSice: polygenic risk score software. Bioinformatics. 2015;31(9):1466–8. doi: 10.1093/bioinformatics/btu848. - DOI - PMC - PubMed
    1. Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol. 2017;41(6):469–80. doi: 10.1002/gepi.22050. - DOI - PubMed
    1. Prive F, Arbel J, Vilhjalmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2020;36(22–23):5424–31. - PMC - PubMed
    1. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. - DOI - PMC - PubMed
    1. International Schizophrenia C, Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. doi: 10.1038/nature08185. - DOI - PMC - PubMed

Publication types