. 2017 Jan 9;45(1):54-66.

doi: 10.1093/nar/gkw1061. Epub 2016 Nov 29.

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

Florian Schmidt^{1

2}, Nina Gasparoni³, Gilles Gasparoni³, Kathrin Gianmoena⁴, Cristina Cadenas⁴, Julia K Polansky⁵, Peter Ebert^{2

6}, Karl Nordström³, Matthias Barann⁷, Anupam Sinha⁷, Sebastian Fröhler⁸, Jieyi Xiong⁸, Azim Dehghani Amirabad^{1

2

6}, Fatemeh Behjati Ardakani^{1

2}, Barbara Hutter⁹, Gideon Zipprich¹⁰, Bärbel Felder¹⁰, Jürgen Eils¹⁰, Benedikt Brors⁹, Wei Chen⁸, Jan G Hengstler⁴, Alf Hamann⁶, Thomas Lengauer², Philip Rosenstiel⁷, Jörn Walter³, Marcel H Schulz^{11

2}

Affiliations

¹ Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany.
² Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany.
³ Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany.
⁴ Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany.
⁵ Experimental Rheumatology, German Rheumatism Research Centre, Berlin, 10117, Germany.
⁶ International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany.
⁷ Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany.
⁸ Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany.
⁹ Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany.
¹⁰ Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany.
¹¹ Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany mschulz@mmci.uni-saarland.de.

PMID: 27899623
PMCID: PMC5224477
DOI: 10.1093/nar/gkw1061

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

Florian Schmidt et al. Nucleic Acids Res. 2017.

. 2017 Jan 9;45(1):54-66.

doi: 10.1093/nar/gkw1061. Epub 2016 Nov 29.

Authors

Affiliations

¹ Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany.
² Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany.
³ Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany.
⁴ Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany.
⁵ Experimental Rheumatology, German Rheumatism Research Centre, Berlin, 10117, Germany.
⁶ International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany.
⁷ Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany.
⁸ Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany.
⁹ Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany.
¹⁰ Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany.
¹¹ Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany mschulz@mmci.uni-saarland.de.

PMID: 27899623
PMCID: PMC5224477
DOI: 10.1093/nar/gkw1061

Abstract

The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively.

PubMed Disclaimer

Figures

**Figure 1.**
The general workflow of *TEPIC* is as follows: Data of an open-chromatin or Histone modification ChIP-seq experiment needs to be preprocessed to generate a genome segmentation, either by peak for footprint calling. Using the segmentation, TEPIC applies TRAP in all regions of interest and computes TF gene scores using exponential decay to reweigh TF binding predictions in open-chromatin regions based on their distance to a genes TSS. In addition, the magnitude of the open-chromatin signal is considered to reweigh TF scores in the segmented regions.

**Figure 2.**
(A) Mean test correlation achieved in gene expression learning is shown for all tested setups and for all samples. The 50 kb-S setup outperformed all other setups in all samples. We observe, that the scaling using the average peak intensity seems to work especially well for DNaseI-seq data, but not so well on NOMe-seq data, as the increase of the mean test correlation between 3 kb and 3 kb-S as well as between 50 kb and 50 kb-S is higher for the DNaseI-seq samples (GM12878, H1-hESC, HepG2, K562, LiHe1, LiHe2 and LiHe3) than for the NOMe-seq samples (others). (B) The learning performance for all setups with a varying number of considered peaks is shown. This analysis is based on HepG2 data only. An interesting observation is that the curves for the 50 kb approaches saturate at around 400 000 peaks, while the 3 kb approach curves steadily increase till all peaks are included in the model.

**Figure 3.**
Gene expression learning results in GM12878, H1-hESC, HepG2, and K562 cells are shown for four different annotation setups using either the positions of H3K4me3 or H3K27ac peaks as input for TEPIC. Scores based on H3K4me3 work better than those based on H3K27ac across all samples.

**Figure 4.**
(A) The scatter plot shows the mean test correlation achieved in gene expression learning using TF affinity scores with TRAP and a hit-based peak annotation computed with Fimo. Clearly, the hit-based scores are outperformed by the TF affinities. (B) The scatter plot shows the mean test correlation achieved in gene expression learning using *TEPIC* applied on peaks and TF scores computed with *Fimo-Prior*. In general *TEPIC* scores show better performance in the expression prediction than those computed with *Fimo-Prior*, although both methods perform similar for several samples. Note that the scaled annotation versions of TEPIC are used in the comparison against *Fimo-Prior*.

**Figure 5.**
The scatter plot shows the mean test correlation achieved in gene expression learning using TF affinities computed within JAMM DNaseI-seq peaks and TF affinities computed within a 24 bp window centred at footprints called using HINTBC. On HepG2 and K562, the peak-based approach outperforms the TF-footprints, whereas in GM12878 footprints lead to a better model performance. On average, H1-hESC samples show a slightly better performance using peaks.

**Figure 6.**
Barplots showing the performance of gene expression learning for HepG2, K562, GM12878 and H1-hESC using several different computational TF scores as well as TF-ChIP-seq data. Although the ChIP-seq data outperformed all computational TF binding prediction methods, *TEPIC* scores achieved good results compared to all other computationally derived scores. In this figure, the best performing variants of the individual methods are represented.

**Figure 7.**
Principal component analysis of normalized model coefficients for all samples considered in this study. There is a clear separation of primary human hepatocytes, cell lines and T-cells.

**Figure 8.**
(A) Venn diagram visualizing the overlap between the liver hepatocyte replicates using the 50 kb-S annotation. In total, 65 factors are shared between the replicates, and only 3, 17 and 19 are selected uniquely. (B) Heatmap listing the top 10 positive and top 10 negative selected features, which are among the 65 shared features in the 50 kb-S setup. TFs labeled with a * could not be validated by literature to be related to hepatocytes.

**Figure 9.**
(A) Heatmap showing the overlap between the T-cell replicates. There are 53 (39%) factors shared between all T-cell samples. (B) The top 10 positive and top 10 negative features among the 53 shared ones, are listed here. TFs labeled with a * could not be validated by literature to be related to regulation in T-cells. For the others, we were able to find literature that sets those factors into relation to T-cells (see Supplementary Table S4).

See this image and copyright information in PMC

References

1. Rao S.S., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. - PMC - PubMed
1. Vaquerizas J.M., Kummerfeld S.K., Teichmann S.A., Luscombe N.M. A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 2009;10:252–263. - PubMed
1. Ferreira S.S., Hotta C.T., Poelking V.G., Leite D.C., Buckeridge M.S., Loureiro M.E., Barbosa M.H., Carneiro M.S., Souza G.M. Co-expression network analysis reveals transcription factors associated to cell wall biosynthesis in sugarcane. Plant Mol. Biol. 2016;91:15–35. - PMC - PubMed
1. Mason M.J., Fan G., Plath K., Zhou Q., Horvath S. Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells. BMC Genomics. 2009;10:327. - PMC - PubMed
1. Wang S., Sun H., Ma J., Zang C., Wang C., Wang J., Tang Q., Meyer C.A., Zhang Y., Liu X.S. Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nat. Protoc. 2013;8:2502–2515. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

Affiliations

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous