. 2010 Jan 14:11:30.

doi: 10.1186/1471-2164-11-30.

Variable structure motifs for transcription factor binding sites

John E Reid¹, Kenneth J Evans, Nigel Dyer, Lorenz Wernisch, Sascha Ott

Affiliations

PMID: 20074339
PMCID: PMC2824720
DOI: 10.1186/1471-2164-11-30

Variable structure motifs for transcription factor binding sites

John E Reid et al. BMC Genomics. 2010.

. 2010 Jan 14:11:30.

doi: 10.1186/1471-2164-11-30.

Authors

John E Reid¹, Kenneth J Evans, Nigel Dyer, Lorenz Wernisch, Sascha Ott

Affiliation

¹ MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Cambridge, CB2 0SR, UK. john.reid@mrc-bsu.cam.ac.uk

PMID: 20074339
PMCID: PMC2824720
DOI: 10.1186/1471-2164-11-30

Abstract

Background: Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets.

Results: We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance.

Conclusions: We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.

PubMed Disclaimer

Figures

**Figure 1**
**Example gapped PWM logo**. An example to demonstrate the gapped PWM model and logo format: A gapped PWM, A, and 2 standard PWMs, B and C, are shown. All three define distributions over 5-mers: note that the last base of C is non-specific and not represented in the logo as it has no information content. The gapped PWM, A, can be viewed as a 70/30 mixture of B and C. That is, 70% of its binding sites look like sites from B and 30% look like sites from C. Put another way: 70% of its sites have a T/C inserted in the centre. The probability of the optional base being inserted in any given binding site is represented in 2 ways: firstly as a percentage written directly onto the logo; secondly, the base is also faded to represent how often it is present.

**Figure 2**
**Search method overview**. Overview of search method. The input sequences are converted into a suffix tree which is used to efficiently enumerate over-represented words. These words are tested as possible seeds for a HMM. For each seed we consider a number of different placements of the gap character. Highly scoring seeds are used to initialise HMMs which are trained using the Baum-Welch algorithm. Each trained HMM defines a gapped PWM and these are scored and ranked. The best gapped PWMs are reported as the output of the method.

**Figure 3**
**HMM state transitions**. An example of a typical HMM state transition diagram. This HMM jointly models background sequence and binding sites from a gapped PWM of length 7. State 0 is the background state. The two arms leading out from state 0 generate binding sites on the positive and negative strands. States 1 and 14 are the first states for binding sites generated in the positive and negative direction respectively. Similarly states 5 and 12 represent the optional base for binding sites generated in the positive and negative direction respectively. When training the HMM various parameters are tied so that they are always equal. For example, the transition parameter from state 0 to state 1 is tied to the parameter for the equivalent transition to state 14. This ensures binding sites are equally likely on both strands of DNA. Similarly emission parameters are tied to ensure binding sites on the negative strand have a distribution that is the reverse complement of the distribution of the binding sites on the positive strand.

**Figure 4**
**Information content of gapped motifs**. Examples showing how position dependencies induced by gap characters can affect the information content of motifs. Compare the gapped PWM C with the standard PWM D. Here the introduction of a gap has decreased the information content as the distribution over 7-mers is more vague. In contrast, PWM B has a higher information content than PWM D. Whether the gap is present or not, the bases around it remain Ts. Hence PWM B has a much sharper distribution over 7-mers. Note the difference in information content between gapped PWMs A and B. The reason is that A is very close to its own reverse complement whereas B is not. Hence A has a sharper distribution than B. All the information contents were calculated relative to a uniform 0-order Markov model.

**Figure 5**
**Analysis of TRANSFAC binding sites**. Realignment of sequences used within TRANSFAC to define PWMs which incorporate optional gaps. **Left:** An alignment, a standard PWM and a gapped PWM for the monomer transcription factor MEF-2. Additional gaps improve the alignment right across the motif, especially the well conserved TA motif that is not apparent in the ungapped alignment. **Right:** An alignment, a standard PWM and a gapped PWM for the homodimer transcription factor POU. Additional gaps improve the alignment of the conserved ATA and TTA motifs. The realignments show a significant proportion of sites both with and without gaps. The upper logos show the original TRANSFAC motifs and their information content in bits (see Methods for the details of the calculation). The lower logos show the motifs after the addition of gaps, indicated by the percentage of sequences where a nucleotide is present.

**Figure 6**
**p53 results**. **Top**: ROC curves for cross-validation on p53 data set using random genomic sequences as counter-examples. **Bottom**: A known TRANSFAC motif for p53 and the motifs our gapped method and MEME found. Using our method, 3% of the sites discovered had an optional spacer between the 2 half-sites. This is a close fit to Wei et al.'s analysis. They found 236 sites without a spacer and 27 that had a 1 base pair spacer.

**Figure 7**
**Stat5 results**. A known TRANSFAC motif (M00459) for Stat5 and the motifs our method found in the Stat5a and Stat5b data sets.

**Figure 8**
**NRSF results**. ROC curves for cross-validation on NRSF data set using random promoter sequences as counter-examples.

**Figure 9**
**Sp1 results**. **Top left**: ROC curves for cross-validation on Sp1 data set using shuffled versions of the held-out test sequences as counter-examples. **Top right**: ROC curves for the motifs found on the small data set when applied to a large Sp1 binding data set from TRANSFAC. The AUC statistics are given in the legend. **Bottom**: A known TRANSFAC motif for Sp1 (the reverse complement of M00196) and the motifs found by the methods we tested. In our model, 32% of the binding sites will have a T inserted after the fifth base. Note that modelling this optional base allows our method to avoid some ambiguity which is present in the Cs preceding the central G in the TRANSFAC motif.

**Figure 10**
**Overall results**. **Top left**: ROC curves for cross-validation across all the data sets using shufflled versions of the held-out test sequences as counter-examples. **Top right**: ROC curves for cross-validation across all the data sets using randomly selected genomic sequences as counter-examples. **Middle**: ROC curves for cross-validation across all the data sets using randomly selected promoters as counter-examples. **Bottom**: AUC and AUC50 statistics for the methods.

**Figure 11**
**C/EBP binding sites**. Binding sites for C/EBP.

See this image and copyright information in PMC

References

1. Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, Wong KY, Sung KW, Lee CW, Zhao XD, Chiu KP, Lipovich L, Kuznetsov VA, Robson P, Stanton LW, Wei CL, Ruan Y, Lim B, Ng HH. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet. 2006;38(4):431–40. doi: 10.1038/ng1760. - DOI - PubMed
1. Tanay A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 2006;16(8):962–72. doi: 10.1101/gr.5113606. - DOI - PMC - PubMed
1. Foat BC, Morozov AV, Bussemaker HJ. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006;22(14):e141–9. doi: 10.1093/bioinformatics/btl223. - DOI - PubMed
1. Scully KM, Jacobson EM, Jepsen K, Lunyak V, Viadiu H, Carrière C, Rose DW, Hooshmand F, Aggarwal AK, Rosenfeld MG. Allosteric effects of Pit-1 DNA sites on long-term repression in cell type specification. Science. 2000;290(5494):1127–1131. doi: 10.1126/science.290.5494.1127. - DOI - PubMed
1. Riley T, Sontag E, Chen P, Levine A. Transcriptional control of human p53-regulated genes. Nat Rev Mol Cell Biol. 2008;9(5):402–412. doi: 10.1038/nrm2395. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

MC_U105260799/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Variable structure motifs for transcription factor binding sites

Affiliation

Variable structure motifs for transcription factor binding sites

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous