Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 9:16:375.
doi: 10.1186/s12859-015-0797-4.

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Affiliations

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data

Ralf Eggeling et al. BMC Bioinformatics. .

Abstract

Background: Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.

Results: To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.

Conclusions: The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Inhomogeneous Parsimonious Markov model of order two for a motif of width 15. The nucleotide distribution of each position in the sequence may depend on the dinucleotide at the two previous positions. Parsimonious context trees (PCTs) are here used for reducing the parameter space by merging context sequences to sets of sequences, interpolating between traditional Markov model (maximal PCT) and the PWM model (minimal PCT). Exemplary PCTs, which cover both special cases and one intermediate case are shown for position 5, position 11, and position 15. The nodes in these PCTs are colored according to the conditioning random variables they correspond to
Fig. 2
Fig. 2
Sequence logos of data sets without meaningful motif. In some cases, we find these repetitive structures that can hardly be considered as transcription factor binding sites
Fig. 3
Fig. 3
Intra-motif dependencies and multiple motif occurrences. The two sequence logos of the left show the motif inferred by a PMM1 for the CTCF and CHD2 data sets. After applying a mixture model of two PWM components on the underlying predicted binding sites, we obtain a clustering that can be represented by two sequence logos. For CTCF, we observe that both sequence logos are similar and resemble the original prediction, and differences among both logos are just an alternative representation of the dependencies at the 3’ end of the motif. For CHD2, we observe that both sequence logos are fundamentally different at all positions. Hence, the corresponding binding sites appear to be bound by two different proteins and just co-occur within the same ChIP-seq data set
Fig. 4
Fig. 4
Aggregated results of fragment-based classification. The left figure shows the AUC for different models in percent averaged over (i) all ten cross-validation iterations for each data set as well as (ii) over all data sets and subgroups thereof. Right figure shows the relative improvement of PMMs of different order in relation to the PWM model according to the Ψ d as defined in Eq. 1, which is also averaged (i) over all cross-validation iterations for each data set as well as (ii) over all data sets and subgroups thereof
Fig. 5
Fig. 5
Data set specific improvements. We show Ψ d for PMMs of different order for all data sets that contain at least one motif, each averaged over the ten cross-validation iterations. For the vast majority of data sets, we find that taking into account intra-motif dependencies via PMMs improves motif discovery substantially
Fig. 6
Fig. 6
Sequence logos and position-specific dependency refinements of several transcription factors. We visualize dependencies of order 1–4 for YY1, NANOG, REST, and USF2 by plotting the traditional sequence logo for each TF and show a position-specific refinement by showing the PCT at one position together with the conditional sequence logos of each leaf in the PCT. The width of the conditional sequence logo is scaled according to the number of sequences in the data that match the particular context, with broad nucleotide stacks representing frequent and narrow nucleotide stacks representing infrequent contexts

Similar articles

Cited by

References

    1. Stormo GD, Schneider TD, Gold LM. Characterization of translational initiation sites in E.coli. Nucleic Acids Res. 1982;10(2):2971–96. doi: 10.1093/nar/10.9.2971. - DOI - PMC - PubMed
    1. Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984;12:505–19. doi: 10.1093/nar/12.1Part2.505. - DOI - PMC - PubMed
    1. Zhang MQ, Marr TG. A weight array method for splicing signals analysis. Comput Appl Biosci. 1993;9:499–509. - PubMed
    1. Barash Y, Elidan G, Friedman N, Kaplan T. Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology. NY, USA: ACM; 2003. Modeling dependencies in protein-DNA binding sites.
    1. Rahmann S, Müller T, Vingron M. On the power of profiles for transcription factor binding site detection. Stat Appl Genet Molec Biol. 2003;2(1):1544–6115. - PubMed

Publication types

LinkOut - more resources