Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jun 26:2023.03.12.532301.
doi: 10.1101/2023.03.12.532301.

Testing for differences in polygenic scores in the presence of confounding

Affiliations

Testing for differences in polygenic scores in the presence of confounding

Jennifer Blanc et al. bioRxiv. .

Update in

Abstract

Polygenic scores have become an important tool in human genetics, enabling the prediction of individuals' phenotypes from their genotypes. Understanding how the pattern of differences in polygenic score predictions across individuals intersects with variation in ancestry can provide insights into the evolutionary forces acting on the trait in question, and is important for understanding health disparities. However, because most polygenic scores are computed using effect estimates from population samples, they are susceptible to confounding by both genetic and environmental effects that are correlated with ancestry. The extent to which this confounding drives patterns in the distribution of polygenic scores depends on patterns of population structure in both the original estimation panel and in the prediction/test panel. Here, we use theory from population and statistical genetics, together with simulations, to study the procedure of testing for an association between polygenic scores and axes of ancestry variation in the presence of confounding. We use a general model of genetic relatedness to describe how confounding in the estimation panel biases the distribution of polygenic scores in a way that depends on the degree of overlap in population structure between panels. We then show how this confounding can bias tests for associations between polygenic scores and important axes of ancestry variation in the test panel. Specifically, for any given test, there exists a single axis of population structure in the GWAS panel that needs to be controlled for in order to protect the test. Based on this result, we propose a new approach for directly estimating this axis of population structure in the GWAS panel. We then use simulations to compare the performance of this approach to the standard approach in which the principal components of the GWAS panel genotypes are used to control for stratification.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Schematic of two different panel configurations. The effect of stratification depends on the overlapping structure between the GWAS and test panels.
(A, C) Two different topologies used to create the GWAS and test panels. (B) Stratification was modeled in the GWAS panel by drawing an individual’s phenotype y~N(0,1) and adding ΔAB if they originated from population B. (D) When there is overlapping structure between GWAS and test panels, there is an expected mean difference between polygenic scores in populations C and D. Additionally, the bias in qˆ increases with the magnitude of stratification in the GWAS. (E) However, when there is no overlapping structure between panels, there is no expected difference in mean polygenic scores between C and D and qˆ remains unbiased regardless of the magnitude of stratification. (F, G) Including F˜Gr as a covariate in the GWAS controls for stratification, eliminating bias in qˆ regardless of ΔAB or the overlapping structure between GWAS and test panels.
Figure 2:
Figure 2:. Error in estimators of F˜Gr depends on the number of SNPs used to compute them.
(A) We simulated a population model with a single split and sampled an equal proportion of individuals from each population to make a GWAS and test panel. (D,C) Here we simulated population models with two splits and sampled individuals in the overlapping structure configuration. (B, E, H) As F˜Gr is known for these population models, we computed the error in Uˆ1 and FˆGr as estimators of F˜Gr using eq. 27 For both estimators, error decreased as the number of SNPs increased. We hold the number of GWAS panel individuals constant at M=1,000 so as L increases the ratio of ML decreases. The error in Uˆ1 does not depend on the population model as the depth of the deepest split is constant across models. Error in FˆGr increases as overlap between panels decreases. (C, F, I) Bias in qˆ computed from using the estimators as covariates in the GWAS follows from the error in the estimators themselves.
Figure 3:
Figure 3:. Stratification bias in more complex demographic scenarios.
GWAS and test panel individuals were simulated using a stepping-stone model with continuous migration. In the GWAS panel, the phenotype is non-heritable and stratified along either latitude (A), the diagonal (B), or in a single deme (C). When effect sizes were estimated in a GWAS with no correction for stratification, polygenic scores constructed in the test panel recapitulate the spatial distribution of the confounder (second column). Including FˆGr (test vector is latitude for A and B, belonging to * deme for C) in the GWAS model eliminates bias in polygenic scores along the test axis (third column) which is also reflected in the association test bias (fifth column). We also compare our approach to including the top 10 PCs (fourth column) which successfully protects the test in A and B but remains biased for C.
Figure 4:
Figure 4:. Quantifying error in estimates of FˆGr and sample PCs for the six-by-six stepping stone demographic model.
(A) Given the stepping stone demographic model used in Figure 3, individuals within a deme are exchangeable and have the same F˜Gr and population PC value. Therefore we used variation within demes to estimate the error in FˆGr and a lower bound for the error in sample PCs (see Section 5.6.1 and Section 5.6.2 for details) for different values of L (we hold M=1,400). The dashed vertical line indicates PC 35, the last population PC we expect to capture real structure. (B) When latitude is the test vector, both sample PCs and FˆGr are well estimated and bias in qˆ is reduced. (C) When a single deme indicator variable is the test vector, higher PCs are needed to capture F˜Gr. These sample PCs are not well estimated and residual bias remains when 35 PCs are used for most values of L.
Figure 5:
Figure 5:. Different patterns of confounding and FˆGr are captured by different GWAS panel sample PCs.
For the three possible combinations of confounding and polygenic score association tests in Figure 3, we plot the variance in either the confounder or FˆGr explained by cumulative GWAS panel sample PCs, with the top 10 PCs highlighted in a darker color. As F˜Gr is unknown for this model, we estimated the error in FˆGr as 0.011 and 0.04 for latitude and the single deme, respectively, and therefore assume it is a decent proxy for F˜Gr. In (A) both the confounder and FˆGr (and therefore F˜Gr) represent variation along latitude and are well captured by the first two PCs. For (B) the confounder varies along the diagonal and these individual deme level differences are not well captured by top sample PCs. In contrast, the test vector is still latitude and FˆGr is again well captured by PCs 1 and 2. Finally, in (C), both the confounder and the test vector represent membership in a single deme and therefore not as well captured by top sample PCs.

References

    1. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–752. 10.1038/nature08185. - DOI - PMC - PubMed
    1. Lander ES, Schork NJ. Genetic dissection of complex traits. Science (New York, NY). 1994;265(5181):2037–2048. 10.1126/science.8091226. - DOI - PubMed
    1. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38(8):904–909. 10.1038/ng1847. - DOI - PubMed
    1. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong Sy, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics. 2010;42(4):348–354. 10.1038/ng.548. - DOI - PMC - PubMed
    1. Loh PR, Tucker G, Bulik-Sullivan BK, Vilhjálmsson BJ, Finucane HK, Salem RM, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature Genetics. 2015;47(3):284–290. 10.1038/ng.3190. - DOI - PMC - PubMed

Publication types