. 2010 Mar 22;5(3):e9722.

doi: 10.1371/journal.pone.0009722.

Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix

Rahul Siddharthan¹

Affiliations

PMID: 20339533
PMCID: PMC2842295
DOI: 10.1371/journal.pone.0009722

Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix

Rahul Siddharthan. PLoS One. 2010.

. 2010 Mar 22;5(3):e9722.

doi: 10.1371/journal.pone.0009722.

Author

Rahul Siddharthan¹

Affiliation

¹ The Institute of Mathematical Sciences, Chennai, Tamil Nadu, India. rsidd@imsc.res.in

PMID: 20339533
PMCID: PMC2842295
DOI: 10.1371/journal.pone.0009722

Abstract

Background: Identifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as "position weight matrices" (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps.

Methodology/principal findings: I describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a "dinucleotide weight matrix" (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined "core motifs" by about 10 bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the "signature" in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region.

Conclusion/significance: While computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The author has declared that no competing interests exist.

Figures

**Figure 1. The distribution of gaps in correlated dinucleotide pairs () in yeast TFs, as described in the text.**
The graph on top shows the full distribution, and the graph below shows only those pairs that are sufficiently abundant (either the predicted or actual number being at least 30% of the total). The green “normalised” bars include a correction for there being fewer possible pairs with larger “gaps”. With this correction, the graphs are more uniform.

formula image — **Figure 2. The relative performance of PWMs and DWMs in predicting binding targets in yeast.**
The figure shows Pearson correlation coefficients of binding site predictions with ChIP binding -values reported by Harbison *et al.* , using the “raw” position weight matrices from MacIsaac *et al.* , dinucleotide weight matrices with the same “width” as the “raw” matrices, and dinucleotide weight matrices with a 10bp “flanking sequence” on either side of the input matrices. Details are in Materials and Methods.

**Figure 3. The precision, as a function of sensitivity, of PWMs and DWMs in predicting targets from MacIsaac *et al.* .**
The precision is the fraction of predictions above a certain logodds cutoff that correspond to documented target genes. The sensitivity is fraction of known targets that are predicted above that cutoff. These are for the same benchmark data as in Figure 2.

**Figure 4. The performance of different methods on individual site predictions in yeast.**
For the same benchmark as in Figure 2, these are the fraction of site predictions that agree with annotated sites in SCPD, as a function of the total number of SCPD sites predicted.

**Figure 5. The precision of site predictions in fruitfly.**
For predictions in synthetic sequence embedding binding site footprints from the REDfly database as well as “fake” sites that are samples of PWMs corresponding to the same factors, this plot shows the precision in predicting REDfly sites, that is, the fraction of predictions that overlap with REDfly footprints, as a function of sensitivity, that is, the fraction of real (REDfly) sites that are predicted. Details of the construction of the synthetic sequence are in Materials and Methods.

**Figure 6. The *discriminative* precision of predictions in fruitfly.**
For the same predictions as in Figure 5, this plot shows the “discriminative precision” for REDfly sites, that is, difference in the fraction of predictions that overlap with REDfly footprints and the fraction of predictions that overlap with “fake” sites, as a function of sensitivity.

See this image and copyright information in PMC

References

1. Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins : Statistical-mechanical theory and application to operators and promoters. Journal of Molecular Biology. 1987;193:723–743. - PubMed
1. Stormo GD, Hartzell GW. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989;86:1183–1187. - PMC - PubMed
1. Hertz GZ, Hartzell GW, Stormo GD. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990;6:81–92. - PubMed
1. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A Sequence Logo Generator. Genome Res. 2004;14:1188–1190. - PMC - PubMed
1. Man T, Stormo GD. Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucl Acids Res. 2001;29:2471–2478. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix

Affiliation

Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases