Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 11:2023.05.11.540453.
doi: 10.1101/2023.05.11.540453.

Enhlink infers distal and context-specific enhancer-promoter linkages

Affiliations

Enhlink infers distal and context-specific enhancer-promoter linkages

Olivier B Poirion et al. bioRxiv. .

Update in

Abstract

Enhancers play a crucial role in regulating gene expression and their functional status can be queried with cell type precision using using single-cell (sc)ATAC-seq. To facilitate analysis of such data, we developed Enhlink, a novel computational approach that leverages single-cell signals to infer linkages between regulatory DNA sequences, such as enhancers and promoters. Enhlink uses an ensemble strategy that integrates cell-level technical covariates to control for batch effects and biological covariates to infer robust condition-specific links and their associated p-values. It can integrate simultaneous gene expression and chromatin accessibility measurements of individual cells profiled by multi-omic experiments for increased specificity. We evaluated Enhlink using simulated and real scATAC-seq data, including those paired with physical enhancer-promoter links enumerated by promoter capture Hi-C and with multi-omic scATAC-/RNA-seq data we generated from the mouse striatum. These examples demonstrated that our method outperforms popular alternative strategies. In conjunction with eQTL analysis, Enhlink revealed a putative super-enhancer regulating key cell type-specific markers of striatal neurons. Taken together, our analyses demonstrate that Enhlink is accurate, powerful, and provides features that can lead to novel biological insights.

PubMed Disclaimer

Conflict of interest statement

Competing interests The author(s) declare(s) that they have no competing interests.

Figures

Figure 1
Figure 1
Enhlink infers linkage by modeling covariates, clusters and the surrounding enhancers. A Chromatin accessibility tracks with enhancer-promoter co-accessibility links inferred with Enhlink from human atrial (aCM) and ventricular (vCM) cardiomyocytes. The enhancer highlighted in blue was previously experimentally validated. B Accuracy scores computed from validated vCM enhancer/promoter pair for the promoter of KCNH2 using scATAC-seq data and compared to their random score distribution. C Enhlink models a target region as a function of its surrounding genomic regions (i.e., enhancers) and biological and technical covariates. Artificial regions are added to reach a sufficient number of variables for computing feature scores and p-values. Enhlink can optionally perform a second-order analysis to identify covariates associated with links. D Enhlink can leverage multi-omics datasets by modelling a target region by either its accessibility or its expression and by intersecting the two resulting sets to identify links shared across both modalities. E Processing time for detecting associations (scenario I) for 200 promoters and their cis (+/− 250kb) OCR features from the islet dataset using four processes and (scenario II) between one promoter and 260,344 cis and trans OCR features using one process. Processing time (left axis for I and right for II) as a function of number of threads per process (bottom axis for I and top for II).
Figure 2
Figure 2
Empirically-parameterized simulation demonstrates Enhlink’s high accuracy. A Workflow to simulate promoter-enhancer associations parameterized by experimental data. The accessibilities of a promoter and its associated enhancers across cells are simulated from a single promoter-enhancer pair having a validated association. The simulated promoter accessibilities are derived by randomly shuffling the binary, scATAC-seq-derived accessibilities of the validated promoter across cells. Each simulated enhancer accessibility for a given cell is generated from the simulated promoter accessibility for that cell via a process that probabilistically flips the cell’s chromatin state: from closed to open (parameterized by λopen) or from open to closed (λclose). λopen and λclose are determined from the validated promoter-enhancer pair. The simulated enhancers are then integrated with the surrounding regions used as background. B λopen and λclose distribution parameters inferred from chromatin accessibility of enhancer-promoter pairs previously validated in human scATAC-seq cardiomyocyte cells (Hocker et. al 2021). Pairs involve the promoter KCNH2 or MYL2 as determined in all cells or in the subset of aCM or vCM cells. C f1-score (y axis) of simulated promoter-enhancer pairs as a function of average promoter accessibility and number of cells. Error bars summarize 20 simulated promoters. Each simulated promoter has between two and seven associated simulated enhancers.
Figure 3
Figure 3
Enhlink outperforms other strategies for inferring linkage on simulated data. A Summary of existing enhancer-promoter method workflows. Some methods use scATAC-seq only as input (Cicero, Chi2 + FDR), others use scATAC-seq combined with scRNA-seq (Signac, SnapATAC). ArchR has a mechanism for both cases. B Enhlink outperforms ATAC-only methods on 400 simulated promoters and 1800 simulated enhancers generated from scATAC-seq data. The scores are computed from the average performance from each simulated promoter (see Methods). C Enhlink outperforms other ATAC-only methods independently of the promoter accessibility. Accuracy is dependent on the promoter accessibility (x axis) with more accessible promoters leading to better f1-scores. D Enhlink outperforms ATAC + RNA methods on 897 simulated genes and 4090 simulated enhancers inferred from the multiome snRNA-/snATAC-seq data. E Enhlink outperforms other ATAC + RNA methods across average gene expression values. Accuracy is dependent on the gene expression (x axis) with more expressed genes leading to better f1-scores (y axis).
Figure 4
Figure 4
Enhlink outperforms other approaches in retrieving PCHi-C links and mitigates batch effects. A UMAP embedding and cell types of the islet dataset. B Enlink, Cicero, and Chi2 performance of promoter-enhancer inference in islet snATAC-seq relative to islet PCHi-C. C UMAP embedding and cell types of the adipose dataset. D Enlink, Cicero, and Chi2 performance of promoter-enhancer inference in adipose snATAC-seq relative to adipose PCHi-C. E Comparison (Mann-Whitney test) of the Enhlink p-value distributions from links intersecting PCHi-C and those not intersecting (control). F Distribution of the batch × link entropy for Cicero, Chi2 and Enhlink from a subset of cells from the islet dataset. Low entropy close to zero indicates links that exist only in a few or a single batch while high entropy indicates links widespread amongst the batches.
Figure 5
Figure 5
Enhlink reveals chromatin regulation mechanisms of striatum Drd1/Drd2 neurons. A Chromatin accessibility (y axis) with Enhlink-inferred links between the promoters and enhancers for Kcnb2, Gulp1, and Col25a1, three marker genes of Drd1 neurons. B Chromatin accessibility and gene expression profiles per genotype for three enhancers (Kcnb2, Gulp1, and Col25a1). C eQTL logarithm of odds (LOD) scores for SNPs within the boundaries of the three enhancers across the eight DO genotypes. Stars indicate genotype harboring an alternative allele within an enhancer of Kcnb2, Gulp1, or Col25a1. Star subscript associates LOD scores in panel C with chromatin accessibility and gene expression in panel B. D Distal Enhlink analysis unveils multiple enhancers from the region 500kb downstream of the Drd1 promoter and linked to the top 10 marker genes of Drd1 neurons (yellow arrows). These genes are also linked to an intronic region of Isl1 (blue arrows), a key gene regulating Drd1/Drd2 genetic programs.

References

    1. ‘A multimodal cell census and atlas of the mammalian primary motor cortex’ (2021) Nature, 598(7879), pp. 86–102. - PMC - PubMed
    1. Breiman L. (2001) ‘Random Forests’, Machine learning, 45(1), pp. 5–32.
    1. Broman K.W. et al. (2019) ‘R/qtl2: Software for Mapping Quantitative Trait Loci with High-Dimensional Data and Multiparent Populations’, Genetics, 211(2). Available at: 10.1534/genetics.118.301595. - DOI - PMC - PubMed
    1. Churchill G.A. et al. (2012) ‘The Diversity Outbred Mouse Population’, Mammalian genome: official journal of the International Mammalian Genome Society, 23(9–10), p. 713. - PMC - PubMed
    1. Claringbould A. and Zaugg J.B. (2021) ‘Enhancers in disease: molecular basis and emerging treatment strategies’, Trends in molecular medicine, 27(11), pp. 1060–1073. - PubMed

Publication types