. 2024 Jun 3;40(6):btae367.

doi: 10.1093/bioinformatics/btae367.

Supervised learning of enhancer-promoter specificity based on genome-wide perturbation studies highlights areas for improvement in learning

Dylan Barth^{1

2}, Richard Van^{1

2}, Jonathan Cardwell³, Mira V Han^{1

2}

Affiliations

¹ School of Life Sciences, University of Nevada, Las Vegas, NV 89154, United States.
² Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, NV 89154, United States.
³ Department of Medicine, University of Colorado School of Medicine, Denver, CO 80045, United States.

PMID: 38870532
PMCID: PMC11211214
DOI: 10.1093/bioinformatics/btae367

Supervised learning of enhancer-promoter specificity based on genome-wide perturbation studies highlights areas for improvement in learning

Dylan Barth et al. Bioinformatics. 2024.

. 2024 Jun 3;40(6):btae367.

doi: 10.1093/bioinformatics/btae367.

Authors

Dylan Barth^{1

2}, Richard Van^{1

2}, Jonathan Cardwell³, Mira V Han^{1

2}

Affiliations

¹ School of Life Sciences, University of Nevada, Las Vegas, NV 89154, United States.
² Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, NV 89154, United States.
³ Department of Medicine, University of Colorado School of Medicine, Denver, CO 80045, United States.

PMID: 38870532
PMCID: PMC11211214
DOI: 10.1093/bioinformatics/btae367

Abstract

Motivation: Understanding the rules that govern enhancer-driven transcription remains a central unsolved problem in genomics. Now with multiple massively parallel enhancer perturbation assays published, there are enough data that we can utilize to learn to predict enhancer-promoter (EP) relationships in a data-driven manner.

Results: We applied machine learning to one of the largest enhancer perturbation studies integrated with transcription factor (TF) and histone modification ChIP-seq. The results uncovered a discrepancy in the prediction of genome-wide data compared to data from targeted experiments. Relative strength of contact was important for prediction, confirming the basic principle of EP regulation. Novel features such as the density of the enhancers/promoters in the genomic region was found to be important, highlighting our lack of understanding on how other elements in the region contribute to the regulation. Several TF peaks were identified that improved the prediction by identifying the negatives and reducing False Positives. In summary, integrating genomic assays with enhancer perturbation studies increased the accuracy of the model, and provided novel insights into the understanding of enhancer-driven transcription.

Availability and implementation: The trained models, data, and the source code are available at http://doi.org/10.5281/zenodo.11290386 and https://github.com/HanLabUNLV/sleps.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Comparison of model performance and data distribution. Precision–recall curves of various models on data from (a) Gasperini2019 outer test folds, (b) Gasperini2019 chromosomes 5, 10, 15, 20. (c) Fulco2019 and (d) Schraivogel2020. The translucent lines are the performance of each model trained on 4-folds. The solid line is the performance combined across the folds. In all cases, except for Schraivogel2020, the XGB full model trained on the integrated genomic data outperforms the XGB simple model without TF features, ABC model and distance-based model. (e) Comparison of Hi-C contact strength and significance. EP pairs are divided into strong contact (≥0.005) and weak contact (<0.005) and positive EPs (significant down-regulation) and negative EPs (not significant). (f–h) EPs are binned by the total count of elements, both enhancers and TSS, in the neighborhood. Legends weak neg: negative with weak contact, strong pos: positive with strong contact. Note that targeted experiments have higher proportion of both strong neg and strong pos, indicating stronger contact compared to the genome-wide data

**Figure 2.**
Comparison of model performance for genes with at least one positive enhancer. Precision–recall curves of various models on data filtered for genes with at least one positive enhancer from (a) Gasperini2019 chromosomes 5, 10, 15, 20, (b) Fulco2019 and (c) Schraivogel2020. In all cases, the XGB full model trained on the integrated genomic data does not outperform the XGB simple model without TF features nor the ABC predictions

**Figure 3.**
Feature importance. (a) Contact features are defined for the focal contact relative to the neighboring contacts. Neighbors are either all the contacts originating from the enhancer, or all the contacts reaching the TSS. ChIP-seq features are defined based on peaks at the enhancer (_e) or peaks at the TSS (_TSS). (b) Top 32 most important features ranked by their mean absolute SHAP values. Samples appear as points along the horizontal axis and colors correspond to the value of that feature. The x-axis predicts the outcome as positive or negative. Note high values of Hi-C contact predict positive outcome (right of the horizontal axis), while low values of distance predict positive outcome (right of the horizontal axis). (c) Hi-C contact shows positive trend with SHAP values, and negative correlation with distance. (d) If the focal contact is the strongest among all contacts originating from the enhancer (i.e. if the difference to the max is zero), then the EP is predicted to be positive

**Figure 4.**
Clusters of correlated TF presence. Clusters (k = 12) of TF co-binding were identified using non-negative matrix factorization (NMF) applied to TF presence matrix at the (a) TSSs and (b) enhancers separately. The TF membership (H matrix) for top three clusters important for prediction is shown, next to their SHAP values

**Figure 5.**
EP detection and regulatory element density in the region. (a) Significant EPs called by Gasperini2019 show stronger Hi-C contact in high-density regions. (b) Mean and median Hi-C contact is lower in high-density regions. (c, d) Absolute counts of False Positives (FP) and False Negatives (FN) and True Positives (TP) for ABC and XGB. (e) Proportion of False Negatives increase while False Positives decrease with density in ABC models. (f) Proportion of False Positives and False Negatives both increase with density in XGB models. (g, h) Hi-C contact strength for True Negatives (TN), FP, FN, and TP in (g) ABC and (h) XGB models

**Figure 6.**
Indirect contacts and regulatory element density. (a, b) For EPs with direct contacts, both *enhancer.count.near.TSS* and *TSS.count.near.enhancer* showed negative trend with SHAP values, indicating they are less likely to be predicted positive when there are many other enhancers or TSS nearby. (c, d) For EPs with indirect contacts, there is a subset of EPs that are more likely to be predicted positive when there are many other enhancers or TSS nearby. (e, f) *remaining*. *TSS.contact.from.enhancer* and *remaining.enhancers.contact.to.TSS* summarize alternative contacts in the neighborhood, and are positively correlated with the density of enhancers and TSSs in the genomic region

**Figure 7.**
Features that are different between positive EPs with strong contact versus weak contact. strong pos, positive EP with strong contact (Hi-C ≥ 0.002); weak pos, positive EP with weak contact (Hi-C < 0.002) (a) Hi-C contact. (b) normalized H3K27ac at the enhancer. (c) TSS count near the enhancer. (d) remaining TSS contact from the enhancer. (e) enhancer count near the TSS. (f) remaining enhancer contact to the TSS.

See this image and copyright information in PMC

References

1. Akiba T, Sano S, Yanase T. et al. Optuna: a next-generation hyperparameter optimization framework. In, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19. Anchorage AK USA. Association for Computing Machinery, New York, NY, USA, pp. 2623–2631.
1. Alexander JM, Guan J, Li B. et al. Live-cell imaging reveals enhancer-dependent Sox2 transcription in the absence of enhancer proximity. Elife 2019;8:e41769. - PMC - PubMed
1. Bergman DT, Jones TR, Liu V. et al. Compatibility rules of human enhancer and promoter sequences. Nature 2022;607:176–84. - PMC - PubMed
1. Cao Q, Anyansi C, Hu X. et al. Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. Nat Genet 2017;49:1428–36. - PubMed
1. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA. 2016, New York, NY, USA: Association for Computing Machinery, pp. 2623–2631.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

1750532/National Science Foundation

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Supervised learning of enhancer-promoter specificity based on genome-wide perturbation studies highlights areas for improvement in learning

Affiliations

Supervised learning of enhancer-promoter specificity based on genome-wide perturbation studies highlights areas for improvement in learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous