Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Nov 13:2023.11.09.563812.
doi: 10.1101/2023.11.09.563812.

An encyclopedia of enhancer-gene regulatory interactions in the human genome

Affiliations

An encyclopedia of enhancer-gene regulatory interactions in the human genome

Andreas R Gschwind et al. bioRxiv. .

Abstract

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Statement Z.A. is employed by Google DeepMind. J.C.U. is an employee of Illumina, Inc. D.R.K. is employed by Calico Life Sciences LLC. Z.W. co-founded Rgenta Therapeutics, and she serves as a scientific advisor for the company and is a member of its board. W.J.G. is an inventor on IP licensed by 10x Genomics. A.Kundaje is on the scientific advisory board of PatchBio, SerImmune and OpenTargets, was a consultant with Illumina, and owns shares in DeepGenomics, ImmunAI and Freenome. J.M.E. is a consultant and equity holder in Martingale Labs, Inc. and has received materials from 10x Genomics unrelated to this study.

Figures

Figure 1.
Figure 1.. The ENCODE resource of enhancer-gene regulatory interactions
a, Project overview: Using CRISPR data in K562 in combination with molecular features for enhancer activity, 3D physical interaction and genomic landscape, we trained new semi-supervised predictive models on the CRISPRi data, called ENCODE-rE2G, to predict enhancer-gene regulatory connections from molecular data. We benchmarked their performance using systematic benchmarking against ground truth datasets from CRISPRi, eQTL and GWAS data and compared ENCODE-rE2G’s performance against other published models. Using the ENCODE-rE2G model we built genome-wide maps of E-G regulatory connections across 352 ENCODE biosamples. b, Enhancer-gene regulatory connections for the PRKAR2B locus, showing detailed data and predictions in K562, DNase-seq and ENCODE-rE2G enhancer-gene connections in three related cell types, and ENCODE-rE2G scores for predicted PRKAR2B enhancers in 92 biosamples with at least three enhancers.
Figure 2.
Figure 2.. Benchmarking predictions of enhancer-gene regulatory interactions
a, We benchmarked the performance of predictive models for enhancer - gene regulatory connections on three different prediction tasks: 1) Linking enhancers to CRISPR-validated target genes, 2) Enrichment of putative regulatory variants from fine-mapped eQTL and GWAS datasets, and 3) Linking variants to putative causal target genes from fine-mapped GWAS datasets. b, Precision-Recall curves showing the performance of predictive models at predicting experimental results of CRISPRi data in K562 cells. Combined CRISPRi data was assembled by combining element-gene pairs from the re-analyzed Nasser et al., 20216, Gasperini et al., 201924 and Schraivogel et al., 202023 datasets (10,375 tested element-gene pairs, 472 positives). Curves represent continuous predictors with the dashed vertical line indicating performance at a threshold corresponding to 70% recall. Single dots represent performance of binary predictors. c, Performance (AUPRC) of quantitative predictors as a function of distance to TSS. Color legend from panel d applies. Error bars represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). d, Enrichment – recall curves showing the ratio of fine-mapped distal noncoding eQTLs with a PIP>0.5 in EBV-transformed lymphocytes in predicted enhancers compared to distal noncoding common variants (enrichment, y-axis) versus fraction of variants overlapping enhancers linked to the correct gene in GM12878 cells (recall, x-axis) across different score thresholds for enhancers predicted by different predictive models. Numerical results are reported in Table S15. e, Average enrichment and recall of GWAS fine-mapped SNPs (PIP > 0.1) in predicted elements for each element-gene linking strategy for 5 “likely relevant” pairs cell lines and biomarker traits: 3 RBC-related (RBC count, Mean corpuscular volume, HbA1c) traits for K562 cell line and 2 Lymphocyte-related (Lymphocyte count, Autoimmune disease-combined) traits for the GM12878 cell line. Error bars denote 95% confidence intervals. f, For 197 non-coding credible sets corresponding to 11 blood-related traits that are linked to exactly one “putatively causal” gene with a coding fine-mapped variant within 2Mb on either side of the lead variant, we compute the precision and recall in linking it to this gene for each of the element-gene predictions in blood biosamples (see Methods). Error bars represent 95% confidence intervals. Numerical results are reported in Table S13.
Figure 3.
Figure 3.. Properties of predicted enhancer-gene regulatory interactions
a. The overall scale of the E–G maps is shown as all enhancers × all genes × 352 biosamples. The heatmaps on the left show all ENCODE-rE2G scores for every enhancer gene pair on chr21 of K562 and Testis. The right plot shows a portion of this plot, displaying a 9.67MB region on chr21. For all heatmaps the lightest yellow represents an ENCODE-rE2G score of 0 and the darkest brown represents a score of 0.994. b. Distance between an enhancer and the gene it regulates across each of the 352 biosamples. Each bar represents the median number of E-G pairs at that distance bin across all 352 biosamples. The locus plot displays chr5:40,650,000–40,695,000 shows 2 enhancers for the gene PTGER4 CD4-positive, alpha-beta T cells. c. Distribution of the average number of genes regulated by a given enhancer according to the ENCODErE2G model across each of the 352 biosamples. Summary statistics available in Table S18. The locus plot displays the region from chr11:85,800,000–86,150,000 and shows an enhancer that regulates two genes in pulmonary artery endothelial cells. d. Distribution of the average number of enhancers regulating a given gene according to the ENCODErE2G model across each of the 352 biosamples. Summary statistics available in Table S19. The locus plot shows the region on chrX:39,850,000 – 40,200,000 in which there are 7 enhancers for the BCOR gene in CD8-positive alpha-beta T cells. e. Distribution of the number of biosamples in which a given enhancer-gene regulatory interaction is detected in the ENCODE-rE2G model. The locus plot shows the region on chr21:44,300,000–44,556,000 in which an enhancer at chr21:44296889–44298071 regulates LRRC3-DT in brain microvascular endothelial cells (upper track) and TRPM2 in right ventricle myocardium superior cells (lower track). f. Top 5 functional enrichment of genes expressed across more than half of biosamples stratified by the average number of enhancers they have across biosamples. Biological processes that are enriched in genes with an average of >= 5 enhancers per biosamples are shown in blue and those enriched in genes with an average of <= 1.5 enhancers per biosample are shown in red. All enrichments reported have an adjusted p-value < 0.05. Genes listed are those with the top 5 most enhancers/biosample and top 5 fewest enhancers/biosample. Fold enrichment is calculated relative to the whole genome background set. The locus plots show enhancers for SMAD7 which averages 14.7 enhancers per biosample (plot range chr18:48780700–49036700) and ZNF317 which averages 0.25 enhancers/biosample (plot range chr19:9,075,000–9,200,000).
Figure 4.
Figure 4.. Linking noncoding GWAS variants to target genes and cell types
a, A schematic of how a risk variant for a disease is linked to a target gene by ENCODE-rE2G + PoPS, a combination of ENCODE-rE2G element-gene link and PoPS gene-level disease prioritization score. b, (left panel) Fraction of fine-mapped GWAS variants across 94 traits and the enrichment of this fraction with respect to control variants and (right panel) precision and recall in identifying the known causal gene for non-coding credible sets, as in Fig. 2f, for the ENCODE-rE2G + PoPS, and ENCODE-rE2G, PoPS and the previously published atlas of enhancer-gene links using ABC-Max. For ENCODE-rE2G + PoPs we intersected the top two ENCODE-rE2G linked genes for predicted enhancers overlapping PIP>0.1 variants in the credible set with the top two PoPs prioritized genes in a 1Mb window around the credible set. For ENCODE-E2G and PoPs alone, we only considered the top one gene for each method. c, Heatmap representing the enrichment of GWAS-finemapped variants (PIP > 0.1) that are linked to a gene by the ENCODE-rE2G + PoPS method for each of 76 GWAS traits (along the columns) and 172 biosamples that are mapped to distinct tissues of origin (grouped along the rows). Only enrichments that are FDR significant (FDR < 0.10) in significance are shown. d, An example fine-mapped causal variant for mean corpuscular hemoglobin, rs875741, linked to the gene CPEB4 (the top-scored PoPS gene in the locus) by ENCODE-rE2G only in hematopoietic progenitors and not in other closely related blood cell types like B-cells, T-cells and K562. Numerical results for each of the Figure panels are reported in Table S13.
Figure 5.
Figure 5.. Molecular features guiding enhancer-gene regulation
a. ENCODE-rE2G features for the PRKAR2B locus. Only features for PRKAR2B E-G pairs are displayed. b. Effect of different feature categories on performance of the ENCODE-rE2G model in classifying K562 enhancer-gene regulatory interactions. Each feature category is removed from the feature set used by ENCODE-rE2G to see the amount of performance reduction. The numbers inside the parentheses show the numbers of the deleted features. Bar plots of delta AUPRC. Bar plots and error bars show the mean of delta AUPRC and 95% confidence intervals when randomly subsampling the EG pairs without replacement (bootstrapping) 1000 times. c. Performance of individual ENCODE-rE2G features at predicting enhancer-gene regulatory connections from CRISPR data. Bars show AUPRC of non-binary features. Error bars represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). d. ENCODE-rE2G’s leave one out performance. Each feature of ENCODE-rE2G is ablated individually and the difference in AUPRC (Delta AUPRC) is calculated. Error bars represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). Features with significant AUPRC reduction (P<0.05) are marked by stars. e. Sequential feature selection performed on ENCODE-rE2G, in which features were selected one at a time based on which one yields the greatest AUPRC. Bars show the change in AUPRC from the previous model including features above it. Gray dots show the total AUPRC for the model including the feature labeled and all above it. See Methods, ‘CRISPRi benchmark’ section for description of statistical test. f. Performance (AUPRC) of Activity-By-Contact (ABC) models using different ENCODE chromatin assays to estimate enhancer activity at predicting experimental results of CRISPRi data in K562 cells. Solid lines correspond to AUPRC of the distance to target TSS baseline predictor. For NCOR1 data from two separate experiments (1, 2) were included. Error bars in barplot represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). g. Performance (AUPRC) of different 3D features at classifying regulatory E-P interactions from CRISPRi perturbations in K562 cells (n = 10,292 pairs). Each feature, apart from ENCODE-rE2G and ENCODErE2Gextended, represents different 3D contact measurements incorporated into the ABC model as the contact component. Error bars represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). h. Precision recall curves showing that the addition of whether a gene ubiquitously expressed or not to the ABC model (hot pink curve) significantly improves on the performance of ABC (blue) (AUPRC increased by 0.044, P < 1.0 × 10−4). Error bars in barplot represent 95% range of AUPRC values inferred via bootstrap (10,000 iterations). See Methods, ‘CRISPRi benchmark’ section for description of statistical test. i. Performance of ENCODE-rE2G models using different assays to measure enhancer activity and 3D contact between elements and gene promoters at predicting results of CRISPRi data in K562 cells. Error bars in barplot represent 95% range of AUPRC values inferred via bootstrap (1000 iterations). See Methods, ‘CRISPRi benchmark’ section for description of statistical test.
Figure 6.
Figure 6.. Super-additive functions of distal enhancers a.
Box plots showing the combined effect size on expression of a given gene from CRISPRi experiments targeting all enhancers of that gene,, by the number of enhancers. b. Acetylation and DNase-seq peaks at the MYC locus in K562. c. Experimental design panel. Individual guides targeting a specific element at the MYC locus are paired with every other guide and transduced into K562-KRAB-dCas9 cells. Effects are measured with CRISPRi-FlowFISH. d. Individual and combined effects on gene expression from perturbing e2 and e3 (left), and perturbing e1 and e7 (right), with the expected effect under additive and multiplicative models annotated. e. Heatmap: Fold change in observed pairwise effects on gene expression versus expected pairwise effects under an additive model from paired CRISPRi screen. f. Heatmap: Normalized Hi-C contact (5 kb resolution) between MYC enhancers in K562 cells. g. Box plots showing the effect of CRISPRi perturbations to an enhancer on H3K27ac at a nearby enhancer stratified by distance,. There is a significant difference in the distributions of perturbation/enhancer distances between 1 and 10 kb and distances greater than 100 kb distance bin (two-sided Wilcoxon rank sum exact test, P=1.79×10−9), between the between distributions for distances between 1 and 10 kb and 10 and 100 kb (P= 7.37×10−6), and between the distributes for distance between 10 kb and 100 kb and greater than 100 kb (P=0.031). See Methods, ‘Data visualization’ section for definition of box plot elements. All data points are shown in addition to box plots. h. Box plots showing effect of CRISPR-Cas9-mediated knockout of enhancers on nearby elements, stratified by distance. There is a significant difference between distributions in each distance group (twosided Wilson rank sum test with continuity correction, 1–10 kb vs 10–100 kb, P = 0.0077; 10–100 kb vs >100 kb, p=0.0026; 1–10 kb vs >100 kb, p=0.00025). See Methods, ‘Data visualization’ section for definition of box plot elements. All data points are shown in addition to box plots.

References

    1. Gasperini M., Tome J. M. & Shendure J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet. 21, 292–310 (2020). - PMC - PubMed
    1. van Arensbergen J., van Steensel B. & Bussemaker H. J. In search of the determinants of enhancer–promoter interaction specificity. Trends Cell Biol. 24, 695–702 (2014). - PMC - PubMed
    1. Zaugg J. B. et al. Current challenges in understanding the role of enhancers in disease. Nat. Struct. Mol. Biol. 29, 1148–1158 (2022). - PubMed
    1. Maurano M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). - PMC - PubMed
    1. Farh K. K.-H. et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 518, 337–343 (2015). - PMC - PubMed

Publication types