Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jan 11;17 Suppl 1(Suppl 1):4.
doi: 10.1186/s12859-015-0846-z.

Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features

Affiliations

Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features

Sunil Kumar et al. BMC Bioinformatics. .

Abstract

Background: Understanding the mechanisms by which transcription factors (TF) are recruited to their physiological target sites is crucial for understanding gene regulation. DNA sequence intrinsic features such as predicted binding affinity are often not very effective in predicting in vivo site occupancy and in any case could not explain cell-type specific binding events. Recent reports show that chromatin accessibility, nucleosome occupancy and specific histone post-translational modifications greatly influence TF site occupancy in vivo. In this work, we use machine-learning methods to build predictive models and assess the relative importance of different sequence-intrinsic and chromatin features in the TF-to-target-site recruitment process.

Methods: Our study primarily relies on recent data published by the ENCODE consortium. Five dissimilar TFs assayed in multiple cell-types were selected as examples: CTCF, JunD, REST, GABP and USF2. We used two types of candidate target sites: (a) predicted sites obtained by scanning the whole genome with a position weight matrix, and (b) cell-type specific peak lists provided by ENCODE. Quantitative in vivo occupancy levels in different cell-types were based on ChIP-seq data for the corresponding TFs. In parallel, we computed a number of associated sequence-intrinsic and experimental features (histone modification, DNase I hypersensitivity, etc.) for each site. Machine learning algorithms were then used in a binary classification and regression framework to predict site occupancy and binding strength, for the purpose of assessing the relative importance of different contextual features.

Results: We observed striking differences in the feature importance rankings between the five factors tested. PWM-scores were amongst the most important features only for CTCF and REST but of little value for JunD and USF2. Chromatin accessibility and active histone marks are potent predictors for all factors except REST. Structural DNA parameters, repressive and gene body associated histone marks are generally of little or no predictive value.

Conclusions: We define a general and extensible computational framework for analyzing the importance of various DNA-intrinsic and chromatin-associated features in determining cell-type specific TF binding to target sites. The application of our methodology to ENCODE data has led to new insights on transcription regulatory processes and may serve as example for future studies encompassing even larger datasets.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overall workflow. Overall approach for prediction of TF site occupancy
Fig. 2
Fig. 2
Classification results in genome wide predicted sites and ENCODE peak lists. The performance in classifying strong versus weak binding sites is reported as area under ROC curve. a Performance of individual feature or feature classes on CTCF sites (predicted sites and ENCODE peak lists) in K562 cell-line. b Feature importance assessed by recursive feature elimination (RFE-SVM)
Fig. 3
Fig. 3
Regression results with cross cell line prediction. The bar plots reflect the prediction ability (Pearson’s correlation coefficient) of regression models trained on one cell line and tested on the same (cross-validation) or another cell line using four different feature sets on ENCODE peak lists. “Seq” consisted of sequence and annotation features; “Chromatin” features included DGF, various histone marks, PolII and co-factors (Rad21 for CTCF and FOSL1 for JunD); “All” included both of them, “All + Str” included structural features in addition to other features. Models for CTCF (a, c) and JunD (b, d) were alternatively trained on data from K562 (a, b) or H1hESC (c, d) cells
Fig. 4
Fig. 4
Feature importance in regression. The size of the colored areas reflect the relative importance of different features in regression models built for CTCF and JunD in different data sets. “p” and “e” in the x-axis denotes predicted sites and ENCODE peaks respectively
Fig. 5
Fig. 5
Overlap between predicted sites and ENCODE peaks. The Venn diagrams show the overlap between predicted TFBS lists and ENCODE peak lists for two cell types (K562 and H1hESC) for (a) CTCF and (b) JunD
Fig. 6
Fig. 6
Regression of multiple factors in different cell lines. Each scatter plot compares the prediction accuracy of regression models (SVM) trained with two different feature sets. (a) TFBS vs all features, (b) histone marks vs all features, (c) TFBS vs histone marks and (d) TFBS vs DGF feature. “Histones” includes the following seven marks: H3K4me2, H3K4me3, H4K20me1, H3K9ac, H3K27ac, H3K27me3, H3K36me3

Similar articles

Cited by

References

    1. Gordan R, Hartemink AJ, Bulyk ML. Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res. 2009;19(11):2090–100. doi: 10.1101/gr.094144.109. - DOI - PMC - PubMed
    1. Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012;13(12):840–52. doi: 10.1038/nrg3306. - DOI - PMC - PubMed
    1. Arvey A, Agius P, Noble WS, Leslie C. Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome Res. 2012;22(9):1723–34. doi: 10.1101/gr.127712.111. - DOI - PMC - PubMed
    1. Stormo GD. Modeling the specificity of protein-DNA interactions. Quant Biol. 2013;1(2):115–30. doi: 10.1007/s40484-013-0012-4. - DOI - PMC - PubMed
    1. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324(5935):1720–3. doi: 10.1126/science.1162327. - DOI - PMC - PubMed

Publication types

LinkOut - more resources