Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 9;45(1):54-66.
doi: 10.1093/nar/gkw1061. Epub 2016 Nov 29.

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

Affiliations

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

Florian Schmidt et al. Nucleic Acids Res. .

Abstract

The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The general workflow of TEPIC is as follows: Data of an open-chromatin or Histone modification ChIP-seq experiment needs to be preprocessed to generate a genome segmentation, either by peak for footprint calling. Using the segmentation, TEPIC applies TRAP in all regions of interest and computes TF gene scores using exponential decay to reweigh TF binding predictions in open-chromatin regions based on their distance to a genes TSS. In addition, the magnitude of the open-chromatin signal is considered to reweigh TF scores in the segmented regions.
Figure 2.
Figure 2.
(A) Mean test correlation achieved in gene expression learning is shown for all tested setups and for all samples. The 50 kb-S setup outperformed all other setups in all samples. We observe, that the scaling using the average peak intensity seems to work especially well for DNaseI-seq data, but not so well on NOMe-seq data, as the increase of the mean test correlation between 3 kb and 3 kb-S as well as between 50 kb and 50 kb-S is higher for the DNaseI-seq samples (GM12878, H1-hESC, HepG2, K562, LiHe1, LiHe2 and LiHe3) than for the NOMe-seq samples (others). (B) The learning performance for all setups with a varying number of considered peaks is shown. This analysis is based on HepG2 data only. An interesting observation is that the curves for the 50 kb approaches saturate at around 400 000 peaks, while the 3 kb approach curves steadily increase till all peaks are included in the model.
Figure 3.
Figure 3.
Gene expression learning results in GM12878, H1-hESC, HepG2, and K562 cells are shown for four different annotation setups using either the positions of H3K4me3 or H3K27ac peaks as input for TEPIC. Scores based on H3K4me3 work better than those based on H3K27ac across all samples.
Figure 4.
Figure 4.
(A) The scatter plot shows the mean test correlation achieved in gene expression learning using TF affinity scores with TRAP and a hit-based peak annotation computed with Fimo. Clearly, the hit-based scores are outperformed by the TF affinities. (B) The scatter plot shows the mean test correlation achieved in gene expression learning using TEPIC applied on peaks and TF scores computed with Fimo-Prior. In general TEPIC scores show better performance in the expression prediction than those computed with Fimo-Prior, although both methods perform similar for several samples. Note that the scaled annotation versions of TEPIC are used in the comparison against Fimo-Prior.
Figure 5.
Figure 5.
The scatter plot shows the mean test correlation achieved in gene expression learning using TF affinities computed within JAMM DNaseI-seq peaks and TF affinities computed within a 24 bp window centred at footprints called using HINTBC. On HepG2 and K562, the peak-based approach outperforms the TF-footprints, whereas in GM12878 footprints lead to a better model performance. On average, H1-hESC samples show a slightly better performance using peaks.
Figure 6.
Figure 6.
Barplots showing the performance of gene expression learning for HepG2, K562, GM12878 and H1-hESC using several different computational TF scores as well as TF-ChIP-seq data. Although the ChIP-seq data outperformed all computational TF binding prediction methods, TEPIC scores achieved good results compared to all other computationally derived scores. In this figure, the best performing variants of the individual methods are represented.
Figure 7.
Figure 7.
Principal component analysis of normalized model coefficients for all samples considered in this study. There is a clear separation of primary human hepatocytes, cell lines and T-cells.
Figure 8.
Figure 8.
(A) Venn diagram visualizing the overlap between the liver hepatocyte replicates using the 50 kb-S annotation. In total, 65 factors are shared between the replicates, and only 3, 17 and 19 are selected uniquely. (B) Heatmap listing the top 10 positive and top 10 negative selected features, which are among the 65 shared features in the 50 kb-S setup. TFs labeled with a * could not be validated by literature to be related to hepatocytes.
Figure 9.
Figure 9.
(A) Heatmap showing the overlap between the T-cell replicates. There are 53 (39%) factors shared between all T-cell samples. (B) The top 10 positive and top 10 negative features among the 53 shared ones, are listed here. TFs labeled with a * could not be validated by literature to be related to regulation in T-cells. For the others, we were able to find literature that sets those factors into relation to T-cells (see Supplementary Table S4).

References

    1. Rao S.S., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. - PMC - PubMed
    1. Vaquerizas J.M., Kummerfeld S.K., Teichmann S.A., Luscombe N.M. A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 2009;10:252–263. - PubMed
    1. Ferreira S.S., Hotta C.T., Poelking V.G., Leite D.C., Buckeridge M.S., Loureiro M.E., Barbosa M.H., Carneiro M.S., Souza G.M. Co-expression network analysis reveals transcription factors associated to cell wall biosynthesis in sugarcane. Plant Mol. Biol. 2016;91:15–35. - PMC - PubMed
    1. Mason M.J., Fan G., Plath K., Zhou Q., Horvath S. Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells. BMC Genomics. 2009;10:327. - PMC - PubMed
    1. Wang S., Sun H., Ma J., Zang C., Wang C., Wang J., Tang Q., Meyer C.A., Zhang Y., Liu X.S. Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nat. Protoc. 2013;8:2502–2515. - PMC - PubMed

Publication types

MeSH terms