. 2023 Nov;55(11):1866-1875.

doi: 10.1038/s41588-023-01529-1. Epub 2023 Oct 19.

Systematic differences in discovery of genetic effects on gene expression and complex traits

Hakhamanesh Mostafavi¹, Jeffrey P Spence², Sahin Naqvi^{2

3}, Jonathan K Pritchard^{4

5}

Affiliations

¹ Department of Genetics, Stanford University, Stanford, CA, USA. hmostafavi@stanford.edu.
² Department of Genetics, Stanford University, Stanford, CA, USA.
³ Department of Chemical and Systems Biology, Stanford University, Stanford, CA, USA.
⁴ Department of Genetics, Stanford University, Stanford, CA, USA. pritch@stanford.edu.
⁵ Department of Biology, Stanford University, Stanford, CA, USA. pritch@stanford.edu.

PMID: 37857933
PMCID: PMC12270542
DOI: 10.1038/s41588-023-01529-1

Systematic differences in discovery of genetic effects on gene expression and complex traits

Hakhamanesh Mostafavi et al. Nat Genet. 2023 Nov.

. 2023 Nov;55(11):1866-1875.

doi: 10.1038/s41588-023-01529-1. Epub 2023 Oct 19.

Authors

Hakhamanesh Mostafavi¹, Jeffrey P Spence², Sahin Naqvi^{2

3}, Jonathan K Pritchard^{4

5}

Affiliations

¹ Department of Genetics, Stanford University, Stanford, CA, USA. hmostafavi@stanford.edu.
² Department of Genetics, Stanford University, Stanford, CA, USA.
³ Department of Chemical and Systems Biology, Stanford University, Stanford, CA, USA.
⁴ Department of Genetics, Stanford University, Stanford, CA, USA. pritch@stanford.edu.
⁵ Department of Biology, Stanford University, Stanford, CA, USA. pritch@stanford.edu.

PMID: 37857933
PMCID: PMC12270542
DOI: 10.1038/s41588-023-01529-1

Abstract

Most signals in genome-wide association studies (GWAS) of complex traits implicate noncoding genetic variants with putative gene regulatory effects. However, currently identified regulatory variants, notably expression quantitative trait loci (eQTLs), explain only a small fraction of GWAS signals. Here, we show that GWAS and cis-eQTL hits are systematically different: eQTLs cluster strongly near transcription start sites, whereas GWAS hits do not. Genes near GWAS hits are enriched in key functional annotations, are under strong selective constraint and have complex regulatory landscapes across different tissue/cell types, whereas genes near eQTLs are depleted of most functional annotations, show relaxed constraint, and have simpler regulatory landscapes. We describe a model to understand these observations, including how natural selection on complex traits hinders discovery of functionally relevant eQTLs. Our results imply that GWAS and eQTL studies are systematically biased toward different types of variant, and support the use of complementary functional approaches alongside the next generation of eQTL studies.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

**Extended Data Fig. 1 |. Genes closest to eQTLs versus eGenes.**
(A) Fraction of eQTLs for which the target eGene is also the gene with the closest TSS, as a function of eQTL association p-value. Error bars show ± 2 standard errors computed as $\sqrt{2 f (1 - f) / M}$ , where $f$ is the estimated fraction, and $M$ is the number of eQTLs per p-value group. In the p-value groups shown, from left to right, there are 50,859, 45,650, 11,246, 4,781, 2,575, and 3,885 eQTLs, respectively. The dashed line shows the mean value of 0.52 across all eQTLs. (B) Same as Fig. 2a, but with different gene assignments to eQTLs (N=118,996). Fraction of eGenes linked to eQTLs (green), or closest genes to eQTLs (red), or closest genes to control SNPs matched for MAF, LD score and gene density (light red) with high pLI (pLI > 0.9, a measure of selective constraint). Error bars corresponding to eQTL properties (red and green points) show 95% confidence intervals as determined by quantile bootstrapping. For matched SNPs (light red), points and error bars show mean values and 95% confidence intervals in 1000 sampling iterations.

**Extended Data Fig. 2 |. Basic variant-level differences between GWAS hits and eQTLs.**
Distribution of minor allele frequency (MAF), linkage disequilibrium (LD) score and gene density for 118,996 eQTLs (red), 22,119 GWAS hits (blue), and 100,000 randomly chosen variants. LD score values are cut at 1000 for clarity.

**Extended Data Fig. 3 |. GWAS and eQTL genes are under different selective constraints: robustness to gene-level measures of selective constraints.**
Logistic regression coefficients corresponding with different gene-level measures of selection for predicting GWAS hits (N=22,119) or eQTLs (N=118,996) versus random SNPs (N=100,000) after adjusting for confounders (see Methods). Results are plotted as regression coefficients on the original data with error bars showing the 2.5^th and 97.5^th percentile over 1000 bootstrap samples. The measures of selection are pLI and LOEUF from the gnomAD study^,, and $h s$ estimates from Agarwal et al.. Lower LOEUF values correspond to higher selective constraints, therefore we used -LOEUF values to match other measures, such that higher values mean higher constraint levels.

**Extended Data Fig. 4 |. GWAS and eQTL genes have different enhancer architectures.**
Same as Fig. 3b, but using enhancer-gene links predicted by the activity-by-contact (ABC) model from Nasser et al. (Methods). For a given gene, we computed (i) the number of biosamples in which a gene has an enhancer, and (ii) the average total enhancer length (in base pairs) across active biosamples. Shown are logistic regression coefficients corresponding with the two enhancer features for predicting 22,119 GWAS hits (blue) and 118,996 eQTLs (red) versus 100,000 random variants after adjusting for confounders (Methods). Results are plotted as regression coefficients on the original data with error bars showing the 2.5^th and 97.5^th percentile over 1000 bootstrap samples.

**Extended Data Fig. 5 |. Contribution of transcription factors (TFs) in Gene Ontology (GO) annotations and their enrichment in GWAS and eQTL genes.**
(A) Proportion of TFs in 41 GO biological processes shown in Fig. 4a. (B) Same as Fig. 4a, but now excluding TFs from all 41 gene categories before computing enrichment values among GWAS and eQTL genes. Traits and tissues (x-axis) are sorted by hit count (decreasing from left to right), and GO terms (y-axis) are sorted by the mean pLI value of associated genes (before removing TFs, replicating the ordering in Fig. 4a). For each trait- or tissue-GO term pair we computed enrichment z-scores based on 1000 sampling iterations of variants matched for MAF, LD score, and gene density (see Methods). The color map represents enrichment (green) or depletion (magenta) of a given gene set among GWAS or eQTL genes. See Fig. 4a for additional details.

**Extended Data Fig. 6 |. Multi-functionality of highly interacting genes in protein-protein interaction (PPI) networks and their enrichment in GWAS genes.**
(A) Proportion of genes in bins ranked by the number of interactions in the InWeb PPI network that are among the top multi-functional genes (defined as top 20% of genes ranked by the count of Gene Ontology (GO) terms they belong to, see Methods). Error bars show 2 standard errors. 16,510 genes with an assigned PPI degree are evenly split into the 5 gene bins shown. (B) Fraction of GWAS and eQTL genes in gene bins ranked by the number of interactions in the InWeb PPI network. For GWAS hits and eQTLs, error bars show 95% confidence intervals as determined by quantile bootstrapping over 1000 sampling iterations. For matched variants (for MAF, LD score and gene density, shown in light blue and red colors), points and error bars show mean values and 95% confidence intervals in 1000 sampling iterations. See Supplementary Table 5 for the counts of genes in each bin shown.

**Extended Data Fig. 7 |. Effect of selection on variants contribution to variance in phenotype and gene expression.**
(A,B) As described in the main text, we consider a model of phenotypic effects mediated by effects on gene expression intermediates: a genetic variant affects the expression of the target gene with effect $β$ , and the gene expression intermediate affects the downstream phenotype with effect size $γ$ . (A) Contribution to phenotypic variance. Under a neutral model, contribution to phenotypic variance, $E [2 p (1 - p)] β^{2} γ^{2}$ , is proportional to phenotypic effect, $β^{2} γ^{2}$ , as effect size and allele frequency are uncoupled. Selection keeps higher effect variants at lower frequencies (that is, lowering $E [2 p (1 - p)]$ ) and thus “flattens” the expected contribution to variance. The red line shows a flattened curve taking $E [2 p (1 - p) β^{2} γ^{2} | β, γ] ~ κ (1 - e^{- β^{2} γ^{2}} / κ)$ , with $κ = 2.986$ (Methods). (B) Contribution to variance in gene expression. Similar to the argument in (A), under neutrality, contribution to variance in gene expression, $E [2 p (1 - p)] β^{2}$ , is proportional to the effect on expression, $β^{2}$ . Under selection, flattening (that is, lowering of $E [2 p (1 - p)]$ ) is more pronounced for variants regulating high-effect (that is, high $γ^{2}$ ) genes. Red lines show trends for four quantiles of $γ^{2}$ , where $γ ~ N (0,1)$ ; darker colors show higher $γ^{2}$ values. See Methods for modeling details.

**Extended Data Fig. 8 |. Depletion of selectively constrained genes among non-GTEx eGenes.**
The factors we described against the discovery of trait-eQTLs likely bias eQTL assays in *any* context. As proof of concept, we show that similar to GTEx eGenes, eGenes identified in non-conventional eQTL assays are also depleted of strongly selected genes. (A) Enrichment of high pLI genes in eGenes identified (i) in fetal brain samples by Aygün et al., (ii) at multiple stages of iPS cells differentiation towards neuronal fate by Jerber et al. and (iii) in GTEx brain tissues. Sample labels for Jerber et al. refer to different ascertained cell types, at different days of differentiation, and in the presence or absence of stimulation by rotenone (ROT). Cell labels for Jerber et al.: Astro, astrocyte-like; DA, dopaminergic neuron; epen1, ependymal-like 1; FPP, floor plate progenitors; prolif. FPP, proliferating floor plate progenitors; sert, serotonergic-like neuron; D11, day 11 of differentiation; D30, day 30; D52, day 52. (B) Enrichment of high pLI genes in eGenes identified in (i) single-cell analyses of blood cell types by Yazar et al. and (ii) GTEx whole blood. Sample labels for Yazar et al. refer to different blood cell types: : B_IN, immature and naive B cell; B_Mem, memory B cell; CD4_ET, CD4+ effector memory and central memory T cell; CD4_NC, CD4+ naive and central memory T cell; CD4_SOX4, CD4+ SOX4 T cell; CD8_ET, CD8+ effector memory T cell; CD8_NC, CD8+ naive and central memory T cell; CD8_S100B, CD8+ S100B T cell; DC, dendritic cell; Mono_C, classical monocyte; Mono_NC, non-classical monocyte; NK, natural killer cell; NK_R, natural killer cell recruiting; Plasma, plasma cell. Enrichment values (on the x-axis) and z-scores (on the y-axis) were computed based on values observed in 10,000 sampling iterations of random genes (Methods).

**Extended Data Fig. 9 |. Effect of eQTL assay sample size on discovery.**
Same as Fig. 6B, but with three eQTL discovery thresholds corresponding to different sample sizes. The discovery thresholds are derived by setting the power rate to 15% for GWAS under the assumptions detailed in the Methods section, and to 10%, 15% and 20% for eQTLs.

**Fig. 1 |. Study workflow and key results.**
Overview of our pipeline for the analysis of GWAS hits and eQTLs. The list of traits and tissues can be found in Supplementary Table 1. We compare GWAS hits and eQTLs with respect to a number of genic and SNP features. A summary of our main results is presented. We describe models for variant discovery in GWAS and eQTL assays to conceptualize these results. See Methods for details.

**Fig. 2 |. GWAS and eQTL genes are under different selective constraints.**
a, Fraction of genes with high pLI (pLI > 0.9, a measure of selective constraint) among genes nearest to 22,119 GWAS hits (blue), 118,996 eQTLs (red) and control SNPs matched for MAF, LD score and gene density. b, Enrichment of high-pLI genes in GWAS genes for individual traits and eQTL genes for individual tissues. Enrichment values ( $x$ axis) and $z$ -scores ( $y$ axis) were derived from 1,000 control SNP sampling iterations. The $z$ -scores were calculated as the difference between the mean of control samples and the values for GWAS hits or eQTLs, divided by the standard deviation of control samples. The Bonferroni correction threshold for absolute $z$ -score values is 3.43 (Methods). c, Fraction of high-pLI genes among eQTL genes (closest gene, red), eGenes (green) and nearest genes to control SNPs (light red) versus eQTL tissue-level rank. For groupings on the $x$ axis, we first bin ranked eQTLs by association $P$ values in groups of 1,000 eQTLs, and then pooled eQTLs across tissues by the ranked bins. This procedure resulted in 33,123, 25,643, 18,256, 13,839, 10,875 and 8,064 eQTLs in the rank bins [1–1,000], [1,001–2,000], [2,001–3,000], [3,001–4,000], [4,001–5,000] and [5,001–6,000], respectively. In panels a and c, error bars show 95% confidence intervals as determined by quantile bootstrapping over 1,000 sampling iterations. For matched SNPs, points show mean values in sets of matched SNPs corresponding to bootstrapped samples (Methods).

**Fig. 3 |. GWAS and eQTL genes have different transcriptional regulatory landscapes.**
a, Schematic of enhancer activity across a number of tissue/cell types. Based on enhancer–gene links inferred from the Roadmap dataset, for a given gene, we computed: (1) the number of tissue/cell types in which a gene has an enhancer, and (2) the average total enhancer length (in base pairs) across tissue/cell types with enhancer activity. b, Logistic regression coefficients corresponding with the two enhancer features described in a for predicting GWAS hits (blue) and eQTLs (red) versus random SNPs ( $N = 100,000$ ) after adjusting for confounders (Methods). c, Mean count of TSSs per gene across cell types in the FANTOM project for GWAS and eQTL genes. d, Enrichment of GWAS and eQTL genes relative to genes linked to matched SNPs ( $y$ axis) in gene bins ranked by connectedness values computed based on coexpression networks from Saha et al. ( $x$ axis). In b–d, error bars show 95% confidence intervals as determined by quantile bootstrapping over 1,000 sampling iterations. For matched SNPs, points show mean values in sets of matched SNPs corresponding to bootstrapped samples. All statistics were computed for 118,996 eQTLs and 22,119 GWAS hits. See Supplementary Table 5 for the counts of genes in each bin shown in panel d.

**Fig. 4 |. Diverse categories of functional genes are enriched among GWAS genes but not among eQTL genes.**
a, Enrichment of genes associated with 41 GO terms among GWAS and eQTL genes (Methods). Traits and tissues ( $x$ axis) are sorted by hit count (decreasing from left to right), and GO terms ( $y$ axis) are sorted by the mean pLI value of associated genes. For each trait– or tissue–GO term pair we computed enrichment $z$ -scores based on 1,000 sampling iterations of SNPs matched for MAF, LD score and gene density (Methods). The color map represents enrichment (green) or depletion (magenta) of a given gene set among GWAS or eQTL genes. Absolute $z$ -scores less than 1.86, which is the threshold value corresponding to a 5% false discovery rate, are colored white. The maximum color intensity is capped at absolute $z$ -score of 4.88 corresponding to the Bonferroni correction threshold (Methods). See Supplementary Table 7 for enrichment and z-score values. DVT, deep vein thrombosis; EBV, Epstein-Barr virus. b, Fraction of GWAS genes (of 22,119) and eQTL genes (of 118,996) that are transcription factor (TFs). c, Enrichment of GWAS and eQTL genes relative to genes linked to matched SNPs ( $y$ axis) in different bins of genes ranked by the counts of GO terms they belong to ( $x$ axis). See Supplementary Table 5 for the counts of genes in each bin. Error bars show 95% confidence intervals as determined by quantile bootstrapping over 1,000 sampling iterations. For matched SNPs in b, points show mean values in sets of matched SNPs corresponding to bootstrapped samples.

**Fig. 5 |. GWAS hits are less enriched at TSSs than are eQTLs.**
a, Distance of eQTLs (left) and GWAS hits (right) to the nearest TSS. Points show fraction of SNPs in 10-kb bins. SNPs more than 100 kb away from their closest TSS are not shown for clarity. b, Enrichment of eQTLs and GWAS hits in promoter and enhancer elements annotated in the FANTOM project (left), and in promoter-like, TSS-proximal enhancer-like and TSS distal enhancer-like elements annotated in the ENCODE project (right). For each annotation, the enrichment value is computed as the fraction of SNPs in the annotation divided by the fraction of all SNPs in the annotation. See Supplementary Table 8 for the counts of SNPs in each annotation. c, Enrichment of eQTLs and GWAS hits relative to random SNPs ( $N = 6,971,256$ ) in ENCODE promoter elements (left) and Roadmap enhancer elements (right) as a function of the quintile of their target gene LOEUF score (a measure of selective constraint). For promoter elements, linking to genes was done taking the closest TSS within 1 kb. For enhancers, enhancer–gene links from Liu et al. were used. See Supplementary Table 8 for the counts of SNPs linked with each LOEUF bin. In b and c, the black dashed lines mark the value of 1 on the $y$ axis. In all panels, error bars show 95% confidence intervals as determined by quantile bootstrapping over 1,000 sampling iterations. For matched SNPs, points show mean values in sets of matched SNPs corresponding to bootstrapped samples. All statistics were computed for 118,996 eQTLs and 22,119 GWAS hits. LoF, loss-of-function.

**Fig. 6 |. A model for variant discovery in GWAS and eQTL assays.**
a, Case of a neutrally evolving phenotype. (Left) Variant discovery in the space defined by variant effect on expression, $β$ , and genic effect on phenotype, $γ$ . Shading colors represent parameter space for the discovery of GWAS hit only (blue), eQTL only (red) and both types (purple), determined conditional on $E [2 p (1 - p)]$ which is independent of the effect sizes in the neutral case. (Right) Schematic of variant discovery mapped to *cis*-regulatory domains as a function of genic contribution to the phenotype. b, Case of a phenotype under selection. Same as panel a, but now the discovery regions in the left panel are determined based on how $E [2 p (1 - p)]$ covaries with effect sizes under a model of selection. Note that the discovery lines shown represent qualitative (and not quantitative) trends, derived under simplifying assumptions for illustrative purposes (Methods). In the Supplementary Note we demonstrate the robustness of discovery trends to various modeling assumptions and choices of parameters.

See this image and copyright information in PMC

References

1. Claussnitzer M et al. A brief history of human disease genetics. Nature 577, 179–189 (2020). - PMC - PubMed
1. Maurano MT et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). - PMC - PubMed
1. Gusev A et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am. J. Hum. Genet 95, 535–552 (2014). - PMC - PubMed
1. Kundaje A et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). - PMC - PubMed
1. Meuleman W et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244–251 (2020). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01HG011432/U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systematic differences in discovery of genetic effects on gene expression and complex traits

Affiliations

Systematic differences in discovery of genetic effects on gene expression and complex traits

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources