Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 May 6:2024.05.05.592437.
doi: 10.1101/2024.05.05.592437.

Functional dissection of complex and molecular trait variants at single nucleotide resolution

Affiliations

Functional dissection of complex and molecular trait variants at single nucleotide resolution

Layla Siraj et al. bioRxiv. .

Abstract

Identifying the causal variants and mechanisms that drive complex traits and diseases remains a core problem in human genetics. The majority of these variants have individually weak effects and lie in non-coding gene-regulatory elements where we lack a complete understanding of how single nucleotide alterations modulate transcriptional processes to affect human phenotypes. To address this, we measured the activity of 221,412 trait-associated variants that had been statistically fine-mapped using a Massively Parallel Reporter Assay (MPRA) in 5 diverse cell-types. We show that MPRA is able to discriminate between likely causal variants and controls, identifying 12,025 regulatory variants with high precision. Although the effects of these variants largely agree with orthogonal measures of function, only 69% can plausibly be explained by the disruption of a known transcription factor (TF) binding motif. We dissect the mechanisms of 136 variants using saturation mutagenesis and assign impacted TFs for 91% of variants without a clear canonical mechanism. Finally, we provide evidence that epistasis is prevalent for variants in close proximity and identify multiple functional variants on the same haplotype at a small, but important, subset of trait-associated loci. Overall, our study provides a systematic functional characterization of likely causal common variants underlying complex and molecular human traits, enabling new insights into the regulatory grammar underlying disease risk.

PubMed Disclaimer

Conflict of interest statement

Competing Interests PCS is a co-founder of and consultant to Sherlock Biosciences and Board Member of Danaher Corporation. PCS and RT hold patents related to the application of MPRA. JCU and FA are employees of Illumina. QSW is an employee of Calico Life Sciences LLC. ZRM is an employee of insitro.

Figures

Figure 1.
Figure 1.. Identification of functional genetic variation underlying fine-mapped complex and molecular traits.
a. Fine-mapped complex trait variants from UKBB and BBJ, as well as fine-mapped eQTL variants from GTEx v8, were included in this study. b. 304,278 unique variants were tested by MPRAs, including 86,064 unique control variants. Nearly 25% of high PIP (> 0.5) eQTLs were associated with gene expression in multiple tissue systems (black), while high PIP complex trait variants were more domain specific. Number of variants per category include overlapping variants, causing totals to exceed the number of unique variants. c. Experimental overview of the massively parallel reporter assay (MPRA) experiment. d. Element activity and allelic activity results for variants. Each point represents one measurement per variant (selected by best log2(fold-change) p-value), with significant expression modulating variants (emVars) denoted in orange, vs non-significant variants in purple. The maximum activity of each variant (ref or alt allele) is shown on the x-axis which has a bimodal activity distribution around 0. Variants with <20 normalized RNA counts are omitted. e. The proportion of variants that are emVars across fine-mapping controls and PIP bins stratified by trait type. Error bars represent 95% CIs. f. Precision-recall plots evaluating different methods for discriminating between equally-sized sets of positives (PIP > 0.9) and negative (PIP < 0.01) variants. Error bars represent 95% CIs.
Figure 2.
Figure 2.. Reporter assays recapitulate endogenous regulatory function.
a. Correlation (Pearson r) of element MPRA activity with chromatin accessibility at promoters (orange) and distal CREs (blue). Ribbons represent 95% confidence intervals. b. TF occupancy at emVars is compared to TF occupancy at high PIP variants for 1,233 TFs. Odds ratios are calculated as emVars vs non-emVars (y-axis) and high PIP (PIP > 0.5) vs low PIP (x-axis). Point size is proportional to the square root of the number of ChIP peaks overlapping variants in this analysis. A linear regression fit through the origin is shown in burgundy. TFs with significant differential enrichment are highlighted in burgundy (Bonferroni-adjusted P < 0.05). c. Chromatin accessibility QTL effects sizes are correlated (Pearson r) with MPRA allelic effects. emVars are shown in orange and non-emVars in purple. d. Correlation (Pearson r) of element activity at promoters and distal CREs separately across 4 tested cell-types. e. Results from a linear regression of normalized motif counts on MPRA activity from 120k CREs in K562 and HepG2 cells. Specific motif families are highlighted in different colors. f. Enrichment of high PIP emVars compared to low PIP non-emVars (odds ratio) for complex traits (dark blue) and eQTLs (light blue) in selected genomic annotations defined from SEI. Error bars represent 95% CIs. g. Proportion of variants in each category with the indicated predicted variant effect mechanism. Error bars represent 95% CIs. h. Correlation (Spearman ρ) between allelic effects in MPRA and TF binding motif scores for significant TFs (FDR < 0.05). The most significant cell-type is shown for each TF. Error bars represent 95% CIs.
Figure 3.
Figure 3.. Evidence for regulatory allelic heterogeneity and multiple causal variants.
a. All 4 diplotypes of fine-mapped variants on the same CRE (< 150 bps, top) were tested across six different windows (bottom) in MPRA. b. Comparison of the expected additive and observed double allele effects after uniformly re-coding diplotypes from largest to smallest effects on activity. Variant pairs with non-additive effects (11%, FDR < 0.05) are shown in red. Each point represents one measurement per variant per cell-type, with the exception of non-additive pairs identified only through a meta-analysis across windows and cell-types, for which one point represents one variant. More dense regions are shown in blue. c. Variant pairs with non-additive effects are physically closer than variant pairs with additive effects (p = 10−8, Binomial test). Boxes show quartiles, with lines at medians and lower and upper hinges at first and third quartiles; lines extend 1.5 times the interquartile range. d. Non-additive pairs of variants are classified into pairs with activating or dampening effects. e. Example of an amplifying non-additive variant pair. rs1936950 and rs1936951 (shaded purple) are associated with changes in ESS2 expression, alter the CTCF binding motif, and fall within a CTCF ChIP-seq peak. Additive prediction is the sum of the re-coded allelic effects from variant 1 (AA vs TA) and variant 2 (TG vs TA). Double variant is the observed difference between (AG vs TA, FDR = 5.5×10−4). f. Comparison of the number of observed variants in each category (emVars in CREs, CREs, or emVars) and the expected number using three types of controls (location-matched, annotation-matched, or low pip). Risk ratios are from a random effects meta-analysis across experiments (library and cell-type). CSs containing up to 5 variants with r2 > 0.9 are included in the analysis. Error bars represent 95% CIs.
Figure 4.
Figure 4.. Saturation mutagenesis of 136 fine-mapped emVars.
a. Schema for two categories included in saturation mutagenesis (SatMut) experiments, emVars with canonical (left) or unknown (right) mechanisms of action. b. Schema of the saturation mutagenesis experiment and analysis. On both allelic backgrounds, mutation of all 200 bases to each of the other three bases is assayed. Short regions that impact transcriptional activity in the SatMut assay are identified as Activity Blocks (ABs) by a Gaussian filtering approach and subsequently matched to motifs PWMs. c. Scatter plot of motif effects on activity from CRE sequences presented in Fig. 2e compared to the log2 enrichment of motifs in repressive or activating ABs in SatMut sequences by cell-type. The size of each observation corresponds to the p-value from AB enrichment test. d.,f.,h.,k. MPRA results of transcriptional activity for the reference (darker shade) and alternative (lighter shade) in either K562 (blues) or HepG2 (purples) for emVars rs536864738 (d.), rs11864973 (f.), rs7953706 (h.), and rs2529369 (k.). Error bars indicate SEs. e.,g.,i.,l. Nucleotide contribution scores across the 200 bp elements (or zoomed in region) containing emVars from d., f., i., and l. are highlighted by a dark yellow bar. Activity measurements for all positions when tested on the reference (top) or alternative (bottom) allele are depicted as lollipops indicating the change from baseline activity (Δ log2 activity). Activity blocks (ABs) are labeled with a gray bar and matching TF motifs are highlighted with a black bar. Shaded boxes overlap allele(s) of interest, with a callout of the SatMut constructed motif (MPRA) and canonical motif PWM (Canonical). j. Scatter plot of baseline log2 activity for all SatMut tested elements between K562 and HepG2. The correlation between single-nucleotide substitutions for each element is shown. m. Violin plot of CHDR3 expression in tibial nerve tissue from GTEx individuals stratified by both for rs2529369 alleles and sex chromosome status (XX and XY). A significant genotype by sex chromosome interaction is observed (P = 0.026).
Figure 5.
Figure 5.. Saturation mutagenesis uncovers canonical, non-canonical, and interacting variant mechanisms.
a. Proportion of emVars explained by canonical TF mechanisms (blue) or non-canonical mechanisms (orange) across different categories of annotation. Error bars represent 95% CIs. b. Cumulative distribution of the percent change in activity for every substitution in each saturation mutagenesis (SatMut) element (teal) or the largest (max) substitution at each nucleotide position (pink). Ribbons represent percent change in activity at the 10th percentile element and the 90th percentile element. c. Trait associated variant effects compared to the largest single nucleotide effect seen in a SatMut element. Colors represent the type of trait and size represents the baseline activity of the element. d. MPRA results of transcriptional activity for rs191148279, which is an emVar in K562. Error bars indicated SEs. e. Nucleotide contribution scores across the 200 bp element containing the fine-mapped variant rs191148279, which are highlighted by a dark yellow bar. Activity measurements for all positions when tested on the reference (top) or alternative (bottom) allele are depicted as lollipops indicating the change from baseline activity (Δ log2 activity). Activity blocks (ABs) are labeled with a gray bar and matching TF motifs are highlighted with a black bar. Shaded boxes overlap allele(s) of interest, with a callout of the SatMut constructed motif (MPRA) and canonical motif PWM (Canonical). Locations of large regulatory effects from rare alleles observed in the UKBB are indicated by pink arrows. f. The proportion of carriers of these rare alleles with decreased HbA1C (> 1 SD) are compared to controls. Error bars indicate 95% CIs. g.,h. MPRA and SatMut results of transcriptional activity for two adjacent emVars rs7282770 and rs7282886, similar to d. and e. except for all 4 diplotypes. Error bars indicate SEs. i.,j. MPRA and SatMut results of transcriptional activity for two interacting emVars rs35081008 and rs34003091, similar to f. and g. except for results in both K562 (blues) and HepG2 (purples).

References

    1. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res 2017;45:D896–901. - PMC - PubMed
    1. Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res 2022:gkac1010. - PMC - PubMed
    1. Loos RJF. 15 years of genome-wide association studies and no signs of slowing down. Nat Commun 2020;11:5900. - PMC - PubMed
    1. Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007;447:661–78. - PMC - PubMed
    1. The support of human genetic evidence for approved drug indications | Nature Genetics. n.d. URL: https://www.nature.com/articles/ng.3314 (Accessed 10 February 2023). - PubMed

Publication types