. 2015 Nov 9:16:375.

doi: 10.1186/s12859-015-0797-4.

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Ralf Eggeling^{1

2}, Teemu Roos³, Petri Myllymäki⁴, Ivo Grosse^{5

6}

Affiliations

¹ Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany. eggeling@cs.helsinki.fi.
² Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland. eggeling@cs.helsinki.fi.
³ Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland. teemu.roos@cs.helsinki.fi.
⁴ Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland. petri.myllymaki@cs.helsinki.fi.
⁵ Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany. grosse@informatik.uni-halle.de.
⁶ German Center for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany. grosse@informatik.uni-halle.de.

PMID: 26552868
PMCID: PMC4640111
DOI: 10.1186/s12859-015-0797-4

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Ralf Eggeling et al. BMC Bioinformatics. 2015.

. 2015 Nov 9:16:375.

doi: 10.1186/s12859-015-0797-4.

Authors

Ralf Eggeling^{1

2}, Teemu Roos³, Petri Myllymäki⁴, Ivo Grosse^{5

6}

Affiliations

¹ Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany. eggeling@cs.helsinki.fi.
² Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland. eggeling@cs.helsinki.fi.
³ Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland. teemu.roos@cs.helsinki.fi.
⁴ Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland. petri.myllymaki@cs.helsinki.fi.
⁵ Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany. grosse@informatik.uni-halle.de.
⁶ German Center for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany. grosse@informatik.uni-halle.de.

PMID: 26552868
PMCID: PMC4640111
DOI: 10.1186/s12859-015-0797-4

Abstract

Background: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.

Results: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.

Conclusions: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

PubMed Disclaimer

Figures

**Fig. 1**
Inhomogeneous Parsimonious Markov model of order two for a motif of width 15. The nucleotide distribution of each position in the sequence may depend on the dinucleotide at the two previous positions. Parsimonious context trees (PCTs) are here used for reducing the parameter space by merging context sequences to sets of sequences, interpolating between traditional Markov model (maximal PCT) and the PWM model (minimal PCT). Exemplary PCTs, which cover both special cases and one intermediate case are shown for position 5, position 11, and position 15. The nodes in these PCTs are colored according to the conditioning random variables they correspond to

**Fig. 2**
Sequence logos of data sets without meaningful motif. In some cases, we find these repetitive structures that can hardly be considered as transcription factor binding sites

**Fig. 3**
Intra-motif dependencies and multiple motif occurrences. The two sequence logos of the left show the motif inferred by a PMM1 for the CTCF and CHD2 data sets. After applying a mixture model of two PWM components on the underlying predicted binding sites, we obtain a clustering that can be represented by two sequence logos. For CTCF, we observe that both sequence logos are similar and resemble the original prediction, and differences among both logos are just an alternative representation of the dependencies at the 3’ end of the motif. For CHD2, we observe that both sequence logos are fundamentally different at all positions. Hence, the corresponding binding sites appear to be bound by two different proteins and just co-occur within the same ChIP-seq data set

**Fig. 4**
Aggregated results of fragment-based classification. The left figure shows the AUC for different models in percent averaged over (i) all ten cross-validation iterations for each data set as well as (ii) over all data sets and subgroups thereof. Right figure shows the relative improvement of PMMs of different order in relation to the PWM model according to the Ψ _d as defined in Eq. 1, which is also averaged (i) over all cross-validation iterations for each data set as well as (ii) over all data sets and subgroups thereof

**Fig. 5**
Data set specific improvements. We show Ψ _d for PMMs of different order for all data sets that contain at least one motif, each averaged over the ten cross-validation iterations. For the vast majority of data sets, we find that taking into account intra-motif dependencies via PMMs improves motif discovery substantially

**Fig. 6**
Sequence logos and position-specific dependency refinements of several transcription factors. We visualize dependencies of order 1–4 for YY1, NANOG, REST, and USF2 by plotting the traditional sequence logo for each TF and show a position-specific refinement by showing the PCT at one position together with the conditional sequence logos of each leaf in the PCT. The width of the conditional sequence logo is scaled according to the number of sequences in the data that match the particular context, with broad nucleotide stacks representing frequent and narrow nucleotide stacks representing infrequent contexts

See this image and copyright information in PMC

Cited by

A map of direct TF-DNA interactions in the human genome.
Gheorghe M, Sandve GK, Khan A, Chèneby J, Ballester B, Mathelier A. Gheorghe M, et al. Nucleic Acids Res. 2019 Feb 28;47(4):e21. doi: 10.1093/nar/gky1210. Nucleic Acids Res. 2019. PMID: 30517703 Free PMC article.
Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network.
Zhang Q, Shen Z, Huang DS. Zhang Q, et al. Sci Rep. 2019 Jun 11;9(1):8484. doi: 10.1038/s41598-019-44966-x. Sci Rep. 2019. PMID: 31186519 Free PMC article.
The orientation of transcription factor binding site motifs in gene promoter regions: does it matter?
Lis M, Walther D. Lis M, et al. BMC Genomics. 2016 Mar 3;17:185. doi: 10.1186/s12864-016-2549-x. BMC Genomics. 2016. PMID: 26939991 Free PMC article.
CircularLogo: A lightweight web application to visualize intra-motif dependencies.
Ye Z, Ma T, Kalmbach MT, Dasari S, Kocher JA, Wang L. Ye Z, et al. BMC Bioinformatics. 2017 May 22;18(1):269. doi: 10.1186/s12859-017-1680-2. BMC Bioinformatics. 2017. PMID: 28532394 Free PMC article.
InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites.
Eggeling R, Grosse I, Grau J. Eggeling R, et al. Bioinformatics. 2017 Feb 15;33(4):580-582. doi: 10.1093/bioinformatics/btw689. Bioinformatics. 2017. PMID: 28035026 Free PMC article.

See all "Cited by" articles

References

1. Stormo GD, Schneider TD, Gold LM. Characterization of translational initiation sites in E.coli. Nucleic Acids Res. 1982;10(2):2971–96. doi: 10.1093/nar/10.9.2971. - DOI - PMC - PubMed
1. Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984;12:505–19. doi: 10.1093/nar/12.1Part2.505. - DOI - PMC - PubMed
1. Zhang MQ, Marr TG. A weight array method for splicing signals analysis. Comput Appl Biosci. 1993;9:499–509. - PubMed
1. Barash Y, Elidan G, Friedman N, Kaplan T. Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology. NY, USA: ACM; 2003. Modeling dependencies in protein-DNA binding sites.
1. Rahmann S, Müller T, Vingron M. On the power of profiles for transcription factor binding site detection. Stat Appl Genet Molec Biol. 2003;2(1):1544–6115. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Affiliations

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources