. 2016 Jan 11;17 Suppl 1(Suppl 1):4.

doi: 10.1186/s12859-015-0846-z.

Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features

Sunil Kumar^{1

2}, Philipp Bucher^{3

4}

Affiliations

¹ Swiss Institute for Experimental Cancer Research (ISREC), School of Life Sciences, EPFL, Station 15, Lausanne, CH-1015, Switzerland. kumar.sunil@epfl.ch.
² Swiss Institute of Bioinformatics (SIB), EPFL, Station 15, Lausanne, CH-1015, Switzerland. kumar.sunil@epfl.ch.
³ Swiss Institute for Experimental Cancer Research (ISREC), School of Life Sciences, EPFL, Station 15, Lausanne, CH-1015, Switzerland. philipp.bucher@epfl.ch.
⁴ Swiss Institute of Bioinformatics (SIB), EPFL, Station 15, Lausanne, CH-1015, Switzerland. philipp.bucher@epfl.ch.

PMID: 26818008
PMCID: PMC4895346
DOI: 10.1186/s12859-015-0846-z

Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features

Sunil Kumar et al. BMC Bioinformatics. 2016.

. 2016 Jan 11;17 Suppl 1(Suppl 1):4.

doi: 10.1186/s12859-015-0846-z.

Authors

Sunil Kumar^{1

2}, Philipp Bucher^{3

4}

Affiliations

¹ Swiss Institute for Experimental Cancer Research (ISREC), School of Life Sciences, EPFL, Station 15, Lausanne, CH-1015, Switzerland. kumar.sunil@epfl.ch.
² Swiss Institute of Bioinformatics (SIB), EPFL, Station 15, Lausanne, CH-1015, Switzerland. kumar.sunil@epfl.ch.
³ Swiss Institute for Experimental Cancer Research (ISREC), School of Life Sciences, EPFL, Station 15, Lausanne, CH-1015, Switzerland. philipp.bucher@epfl.ch.
⁴ Swiss Institute of Bioinformatics (SIB), EPFL, Station 15, Lausanne, CH-1015, Switzerland. philipp.bucher@epfl.ch.

PMID: 26818008
PMCID: PMC4895346
DOI: 10.1186/s12859-015-0846-z

Abstract

Background: Understanding the mechanisms by which transcription factors (TF) are recruited to their physiological target sites is crucial for understanding gene regulation. DNA sequence intrinsic features such as predicted binding affinity are often not very effective in predicting in vivo site occupancy and in any case could not explain cell-type specific binding events. Recent reports show that chromatin accessibility, nucleosome occupancy and specific histone post-translational modifications greatly influence TF site occupancy in vivo. In this work, we use machine-learning methods to build predictive models and assess the relative importance of different sequence-intrinsic and chromatin features in the TF-to-target-site recruitment process.

Methods: Our study primarily relies on recent data published by the ENCODE consortium. Five dissimilar TFs assayed in multiple cell-types were selected as examples: CTCF, JunD, REST, GABP and USF2. We used two types of candidate target sites: (a) predicted sites obtained by scanning the whole genome with a position weight matrix, and (b) cell-type specific peak lists provided by ENCODE. Quantitative in vivo occupancy levels in different cell-types were based on ChIP-seq data for the corresponding TFs. In parallel, we computed a number of associated sequence-intrinsic and experimental features (histone modification, DNase I hypersensitivity, etc.) for each site. Machine learning algorithms were then used in a binary classification and regression framework to predict site occupancy and binding strength, for the purpose of assessing the relative importance of different contextual features.

Results: We observed striking differences in the feature importance rankings between the five factors tested. PWM-scores were amongst the most important features only for CTCF and REST but of little value for JunD and USF2. Chromatin accessibility and active histone marks are potent predictors for all factors except REST. Structural DNA parameters, repressive and gene body associated histone marks are generally of little or no predictive value.

Conclusions: We define a general and extensible computational framework for analyzing the importance of various DNA-intrinsic and chromatin-associated features in determining cell-type specific TF binding to target sites. The application of our methodology to ENCODE data has led to new insights on transcription regulatory processes and may serve as example for future studies encompassing even larger datasets.

PubMed Disclaimer

Figures

**Fig. 1**
Overall workflow. Overall approach for prediction of TF site occupancy

**Fig. 2**
Classification results in genome wide predicted sites and ENCODE peak lists. The performance in classifying strong versus weak binding sites is reported as area under ROC curve. a Performance of individual feature or feature classes on CTCF sites (predicted sites and ENCODE peak lists) in K562 cell-line. b Feature importance assessed by recursive feature elimination (RFE-SVM)

**Fig. 3**
Regression results with cross cell line prediction. The bar plots reflect the prediction ability (Pearson’s correlation coefficient) of regression models trained on one cell line and tested on the same (cross-validation) or another cell line using four different feature sets on ENCODE peak lists. “Seq” consisted of sequence and annotation features; “Chromatin” features included DGF, various histone marks, PolII and co-factors (Rad21 for CTCF and FOSL1 for JunD); “All” included both of them, “All + Str” included structural features in addition to other features. Models for CTCF (a, c) and JunD (b, d) were alternatively trained on data from K562 (a, b) or H1hESC (c, d) cells

**Fig. 4**
Feature importance in regression. The size of the colored areas reflect the relative importance of different features in regression models built for CTCF and JunD in different data sets. “p” and “e” in the x-axis denotes predicted sites and ENCODE peaks respectively

**Fig. 5**
Overlap between predicted sites and ENCODE peaks. The Venn diagrams show the overlap between predicted TFBS lists and ENCODE peak lists for two cell types (K562 and H1hESC) for (a) CTCF and (b) JunD

**Fig. 6**
Regression of multiple factors in different cell lines. Each scatter plot compares the prediction accuracy of regression models (SVM) trained with two different feature sets. (a) TFBS vs all features, (b) histone marks vs all features, (c) TFBS vs histone marks and (d) TFBS vs DGF feature. “Histones” includes the following seven marks: H3K4me2, H3K4me3, H4K20me1, H3K9ac, H3K27ac, H3K27me3, H3K36me3

See this image and copyright information in PMC

Cited by

SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model.
Zhang Y, Chu X, Jiang Y, Wu H, Quan L. Zhang Y, et al. Genes (Basel). 2022 Mar 23;13(4):568. doi: 10.3390/genes13040568. Genes (Basel). 2022. PMID: 35456374 Free PMC article.
An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced transcription factor binding.
Srivastava D, Aydin B, Mazzoni EO, Mahony S. Srivastava D, et al. Genome Biol. 2021 Jan 7;22(1):20. doi: 10.1186/s13059-020-02218-6. Genome Biol. 2021. PMID: 33413545 Free PMC article.
Cross-Cell-Type Prediction of TF-Binding Site by Integrating Convolutional Neural Network and Adversarial Network.
Lan G, Zhou J, Xu R, Lu Q, Wang H. Lan G, et al. Int J Mol Sci. 2019 Jul 12;20(14):3425. doi: 10.3390/ijms20143425. Int J Mol Sci. 2019. PMID: 31336830 Free PMC article.
MTTFsite: cross-cell type TF binding site prediction by using multi-task learning.
Zhou J, Lu Q, Gui L, Xu R, Long Y, Wang H. Zhou J, et al. Bioinformatics. 2019 Dec 15;35(24):5067-5077. doi: 10.1093/bioinformatics/btz451. Bioinformatics. 2019. PMID: 31161194 Free PMC article.
Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility.
Liu S, Zibetti C, Wan J, Wang G, Blackshaw S, Qian J. Liu S, et al. BMC Bioinformatics. 2017 Jul 27;18(1):355. doi: 10.1186/s12859-017-1769-7. BMC Bioinformatics. 2017. PMID: 28750606 Free PMC article.

See all "Cited by" articles

References

1. Gordan R, Hartemink AJ, Bulyk ML. Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 2009;19(11):2090–100. doi: 10.1101/gr.094144.109. - DOI - PMC - PubMed
1. Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012;13(12):840–52. doi: 10.1038/nrg3306. - DOI - PMC - PubMed
1. Arvey A, Agius P, Noble WS, Leslie C. Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome Res. 2012;22(9):1723–34. doi: 10.1101/gr.127712.111. - DOI - PMC - PubMed
1. Stormo GD. Modeling the specificity of protein-DNA interactions. Quant Biol. 2013;1(2):115–30. doi: 10.1007/s40484-013-0012-4. - DOI - PMC - PubMed
1. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324(5935):1720–3. doi: 10.1126/science.1162327. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features

Affiliations

Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous