Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 7;12(1):3394.
doi: 10.1038/s41467-021-23134-8.

Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs

Collaborators, Affiliations

Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs

Qingbo S Wang et al. Nat Commun. .

Abstract

The large majority of variants identified by GWAS are non-coding, motivating detailed characterization of the function of non-coding variants. Experimental methods to assess variants' effect on gene expressions in native chromatin context via direct perturbation are low-throughput. Existing high-throughput computational predictors thus have lacked large gold standard sets of regulatory variants for training and validation. Here, we leverage a set of 14,807 putative causal eQTLs in humans obtained through statistical fine-mapping, and we use 6121 features to directly train a predictor of whether a variant modifies nearby gene expression. We call the resulting prediction the expression modifier score (EMS). We validate EMS by comparing its ability to prioritize functional variants with other major scores. We then use EMS as a prior for statistical fine-mapping of eQTLs to identify an additional 20,913 putatively causal eQTLs, and we incorporate EMS into co-localization analysis to identify 310 additional candidate genes across UK Biobank phenotypes.

PubMed Disclaimer

Conflict of interest statement

D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme.

Figures

Fig. 1
Fig. 1. Examples of the enrichment of variant–gene pairs in whole-blood eQTL PIP bins for functional genomics features.
Enrichments of variant–gene pairs in different posterior inclusion probability (PIP) bins in binary functional features (non-tissue specific (a), tissue-specific in peripheral blood mononuclear cells (b), deep learning-derived regulatory activity (CAGE) prediction in neutrophils (c), and distance to TSS (d) are shown (n is the number of variant–gene pairs).
Fig. 2
Fig. 2. Schematic overview and feature importance of the expression modifier score (EMS).
a EMS is built by (1) defining the training data based on fine-mapping of GTEx v8 data, (2) annotating the variant–gene pairs with functional features, and (3) training a random forest classifier. We do this for each tissue. b, c Feature importance (mean decrease of impurity MDI) for four different feature categories (b), and top features for each category (c). Baseline annotations are non-tissue-specific binary annotations from Finucane et al., and histone marks are tissue-specific binary histone mark annotations from Roadmap. In b, n is the number of features in the category.
Fig. 3
Fig. 3. Performance evaluation of EMS.
Comparison of the different scoring methods in prioritizing putative causal whole-blood eQTLs in GTEx v8 (a), massive parallel reporter assay (MPRA) saturation mutagenesis hits (b), reporter assay QTLs (raQTLs) (c), and putative hematopoietic-trait causal variants in UKBB (d) in different score percentiles.
Fig. 4
Fig. 4. Functionally informed fine-mapping with EMS as a prior.
a Number of variant–gene pairs in different PIP bins using a uniform prior vs EMS as a prior. b Number of variants in the 95% credible set (CS) identified by fine-mapping with uniform prior vs EMS as a prior. c Enrichment of reporter assay QTLs (raQTLs) in different PIP bins (gray: publicly available eQTL PIP using DAP-G, blue: uniform prior, orange: EMS as a prior).
Fig. 5
Fig. 5. Functionally informed fine-mapping across 49 tissues.
a The number of additional putative causal eQTLs (defined by PIPEMS > 0.9 and PIPunif < 0.9) for each tissue is shown in descending order. bd Mean Basenji score in different classes of tissue-specific putative causal eQTLs for tissue-specific TF-related Basenji features for liver (b), whole blood (c), and LCLs (d). In 39 out of all 42 features across all three tissues, the mean Basenji score in tissue-specific putative causal eQTLs identified by PIPEMS is significantly higher in the corresponding tissue than in control tissues (t test p < 0.05/42). This changes to 36 in 42 when using PIPunif instead of PIPEMS. The enrichment of mean Basenji score in putative causal eQTLs in the corresponding tissue compared to control tissues is higher for PIPEMS than PIPunif for all 42 tissues (p < 10−100 in aggregate), consistent with our understanding that functionally informed fine-mapping using EMS utilizes cell-type-specific functional enrichments, identified from putative causal eQTLs identified with a uniform prior, to identify additional putative causal eQTLs. Duplicated names are distinct features corresponding to biological replicates in the TF activity measurements. Out of 17,960 tissue-specific putative causal eQTLs, n = 222 were for liver (b), n = 1758 were for whole blood (c), and n = 140 were for LCL (d).
Fig. 6
Fig. 6. An example of a putative causal eQTL prioritized by EMS.
rs35873233, an upstream variant of CITED4, was prioritized by functionally informed fine-mapping using EMS as a prior. From top to the bottom: PIP with uniform prior (PIPunif), EMS, PIP with EMS as a prior (PIPEMS); Basenji score for CAGE activity in acute myeloid leukemia (AML), H3K27me3 narrow peak in K562 cell line (red if the variant is on the peak, blue otherwise), sequence context of the alternative allele aligned with the binding motif of SPI1, and PIP for neutrophil count in UKBB (https://www.finucanelab.org/data, ref. ) with uniform prior.

References

    1. Maurano MT, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. - DOI - PMC - PubMed
    1. Paul DS, Soranzo N, Beck S. Functional interpretation of non-coding sequence variation: concepts and challenges. Bioessays. 2014;36:191–199. doi: 10.1002/bies.201300126. - DOI - PMC - PubMed
    1. Maller JB, et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat. Genet. 2012;44:1294–1301. doi: 10.1038/ng.2435. - DOI - PMC - PubMed
    1. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature489, 57–74 (2012). - PMC - PubMed
    1. Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. - DOI - PMC - PubMed

Publication types