Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 1;31(21):3445-50.
doi: 10.1093/bioinformatics/btv391. Epub 2015 Jun 30.

A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites

Affiliations

A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites

Jichen Yang et al. Bioinformatics. .

Abstract

Motivation: The position-weight matrix (PWM) is a useful representation of a transcription factor binding site (TFBS) sequence pattern because the PWM can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual nucleotide positions, the PWMs for some TFs poorly discriminate binding sites from non-binding-sites that have similar sequence content. Since the local three-dimensional DNA structure ('shape') is a determinant of TF binding specificity and since DNA shape has a significant sequence-dependence, we combined DNA shape-derived features into a TF-generalized regulatory score and tested whether the score could improve PWM-based discrimination of TFBS from non-binding-sites.

Results: We compared a traditional PWM model to a model that combines the PWM with a DNA shape feature-based regulatory potential score, for accuracy in detecting binding sites for 75 vertebrate transcription factors. The PWM+shape model was more accurate than the PWM-only model, for 45% of TFs tested, with no significant loss of accuracy for the remaining TFs.

Availability and implementation: The shape-based model is available as an open-source R package at that is archived on the GitHub software repository at https://github.com/ramseylab/regshape/.

Contact: stephen.ramsey@oregonstate.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Outline of DNA shape-based features, TF-general binding site classifier performance, and structure of PWM+shape model. (A) The five DNA shape parameter-based features. (B) Performance comparison for three classifiers for discriminating TFBS sequences (in general, not specific to a particular TF) from non-binding-site, noncoding sequences (see Section 2.3). SVM, support vector machine (Vapnik et al., 1996); AUC, area under the sensitivity versus False-positive error rate curve (i.e. ‘ROC curve’); ADA, additive logistic regression (Friedman et al., 2000). AUC of an unbiased random classifier would be 0.5. (C) Data integration strategy
Fig. 2.
Fig. 2.
Procedure for constructing sets of positive and negative cases for performance evaluation in Experiment 1 (see Section 3.1). Here, cases correspond to PWM-length oligonucleotide sequences sampled from representative binding sites and from non-binding-site, noncoding sequence (see Section 2.1)
Fig. 3.
Fig. 3.
Combined PWM+shape model improves PPV for discriminating TFBS from non-binding-site sequences, over the PWM-only model, for 53 out of 73 TFs. Bars, standard error (SE, N = 20). Asterisk denotes rejection of null hypothesis of equal means, with α = 0.05 (Welch’s t-test)
Fig. 4.
Fig. 4.
(A) Diagram of the in silico promoter model for TFBS recognition. (B) Combined PWM+shape model improves average PPV for detection of TFBS within 1 kb in silico promoters, versus a PWM-only model, for 34 out of the 75 TFs that were tested. Bars, SE (90 ≤ N  ≤ 900 for each). Asterisk denotes rejection of null hypothesis of equal means, with α = 0.05 (Welch’s t-test). (C) The difference in error rates between the PWM+shape model and the PWM-only model shows that filtering by DNA shape decreases type I error more than it increases type II error, for sequence-based recognition of TFBS. (Inset: scatter plot of same data show on the bar-plot)
Fig. 5.
Fig. 5.
The DNA shape score is sensitive to multibase dependencies that are not captured by the PWM. (A) TFBS sequence logo for TF Znf263. (B) PWM submatrix from position 19 to 21 for TF ZNF263 (red box in A). (C) PWM scores for Znf263 before and after ‘AAA’ replacement (here, the suffix ‘AAA’ is under-represented among true binding sites). The same PWM model is used for both sets of oligomers. Bar, box and whiskers: median, interquartile range (IQR) and median  ±  1.5 IQR. (D) DNA shape score for Znf263 before and after ‘AAA’ replacement. The same model is used for both sets of oligomers

Similar articles

Cited by

References

    1. Berger M.F., et al. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429–1435. - PMC - PubMed
    1. Breiman L. (2001) Random forests. Machine Learn., 45, 5–32.
    1. Bulyk M.L., et al. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res., 30, 1255–1261. - PMC - PubMed
    1. Cartharius K., et al. (2005) MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics, 21, 2933–2942. - PubMed
    1. Chen Y., et al. (2007) Integration of genome and chromatin structure with gene expression profiles to predict c-MYC recognition site binding and function. PLoS Comput. Biol., 3, e63. - PMC - PubMed

Publication types