Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 23;8(1):27-42.e6.
doi: 10.1016/j.cels.2018.12.001. Epub 2019 Jan 16.

A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs

Affiliations

A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs

Md Abul Hassan Samee et al. Cell Syst. .

Abstract

DNA shape adds specificity to sequence motifs but has not been explored systematically outside this context. We hypothesized that DNA-binding proteins (DBPs) preferentially occupy DNA with specific structures ("shape motifs") regardless of whether or not these correspond to high information content sequence motifs. We present ShapeMF, a Gibbs sampling algorithm that identifies de novo shape motifs. Using binding data from hundreds of in vivo and in vitro experiments, we show that most DBPs have shape motifs and can occupy these in the absence of sequence motifs. This "shape-only binding" is common for many DBPs and in regions co-bound by multiple DBPs. When shape and sequence motifs co-occur, they can be overlapping, flanking, or separated by consistent spacing. Finally, DBPs within the same protein family have different shape motifs, explaining their distinct genome-wide occupancy despite having similar sequence motifs. These results suggest that shape motifs not only complement sequence motifs but also facilitate recognition of DNA beyond conventionally defined sequence motifs.

Keywords: ChIP-Seq; DNA binding protein; DNA shape; Gibbs sampling; HT-SELEX; algorithm; de novo shape motif discovery; sequence motif; shape motif; shape specificity; shape-only binding; shape-specific binding; transcription factor.

PubMed Disclaimer

Conflict of interest statement

Declarations of Interests

Authors have no financial interest to declare. KSP is a consultant or advisor for Tenaya Therapeutics, uBiome, and Phylagen. BGB is a founder of Tenaya Therapeutics.

Figures

Figure 1.
Figure 1.
Overview of ShapeMF. (A) Shape-motif discovery involves comparing DBP-bound regions (positives; solid lines) to non-bound regions (negatives; dashed lines). For each region (different colors) and each shape feature (e.g., MGW), ShapeMF takes the profile of feature values across the region. The Gibbs sampler then identifies a set of short windows from the positive profiles that have similar patterns of the shape feature. In the second step, this initial set of positive windows is refined so that the resulting windows share a shape pattern that has the maximum accuracy to discriminate between positives and negatives (according to F1/3-score). Finally, this pattern is called a shape-motif if its enrichment in a separately held-out set of positive versus negative sequences is significant (Bonferroni-adjusted hypergeometric p-value < 10−5). The range of feature values at each position of the window defines the shape-motif. (B) A shape-motif occurs in a sequence if it contains a window whose feature values at every position fit within the ranges defined by the shape-motif. On the other hand, a sequence-motif occurs in a sequence if it contains a window that is significantly similar to the multinomial model defined by the sequence-motif. (C) We visualize sequence logos from the sequences underlying the occurrences of a shape-motif and the range of feature values in the 50-bp regions flanking up- and downstream its occurrences (both shown below the shape-motif), X-axis shows the positions along the shape-motif and its flanking regions; Y-axis shows the range of shape-feature values at each position.
Figure 2.
Figure 2.
Most DBPs have shape-motifs. (A) Heatmap of negative-log10 transformed Bonferroni corrected p-values for each type of shape-motif of each DBP. White cells indicate no significant motif. ‘K’ or ‘G’ after a DBP’s name denotes K562 or Gml2878 cells, respectively. (B) Fraction (%) of DBPs with each type of shape-motif. (C) Length distribution of shape-motifs. Orange and green bars in B, C represent data from K562 and Gml2878 cell-lines, respectively.
Figure 3.
Figure 3.
Shape-motifs learned separately in vitro and in vivo are similar. Each panel shows the ShapeMF shape-motifs of one DBP for one shape feature. X-axis shows the positions along the shape-motif; Y-axis shows the range of shape-feature values at each position of the shape-motif. Shape-motifs derived from ChIP-Seq (purple) and HT-SELEX (green) are superimposed. The similarity score in parentheses is the mean intersection over union (IoU) optimized over the possible alignments between motifs from the two sources.
Figure 4.
Figure 4.
Prevalence and genomic properties of shape- and sequence-motifs. (A) Sequence-only, shape-only, and overlapping-sites are located at similar distances relative to ChIP-Seq peak-summits. Each box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. Orange and green boxes represent data from K562 and Gml2878 cell-lines, respectively. (B) Heatmap showing the proportion of shape-only, common, and sequence-only peaks. ‘K’ or ‘G’ after a DBP’s name denotes K562 or Gml2878 cells, respectively. (C) Histogram showing fraction (%) of DBPs binned according to the fraction (%) of their shape-only peaks genome-wide. Orange and green bars represent data from K562 and Gml2878 cell-lines, respectively. (D) Fraction (%) of final-round HT-SELEX oligonucleotides and ChlP-Seq peaks that are shape-only, co-occurrence, and sequence-only. ‘S’ or ‘C’ after a DBP’s name denotes HT-SELEX or ChIP-Seq, respectively. (E) Barplots showing the faction (%) of shape-only peaks and sequence-based (sequence-only plus co-occurrence) peaks in different types of regulatory regions in the K562 cell-line. Shape-only peaks are more enriched in putative enhancer regions than are sequence-based peaks; significant enrichments are marked with stars.
Figure 5.
Figure 5.
Scenarios of shape- and sequence-motif co-occurrence. Each case is shown with a schematic and an example. A schematic uses a cartoon DNA double helix (from http://veleta.rosety.com) and two boxes representing sequence- and shape-sites. A shape-site can (A) flank a sequence-site from both ends, (B) be flanked by a sequence-site from both ends, (C) flank a sequence-site from one end, or (D) occur with an inter-site gap. For each case, the top panel shows shape-related information: the shape-motif pattern (in inset), the range of shape-feature values in the flanking 50-bp regions (up- and downstream) of occurrences of the shape-motif, and the sequence-logo created from sequences underlying the shape-motif. X-axis shows the positions along the genome (or the shape-motif occurrence); Y-axis shows shape-feature values. The bottom panel shows the known sequence-motif of the corresponding DBP.
Figure 6.
Figure 6.
Co-binding DBP pairs often utilize shape-specific binding. For many DBP pairs F1,F2, (A) co-bound peaks lack a sequence-motif of F1 and (B) these often have a shape-motif of F1 (C) DBPs often show preferential utilization of shape-sites in the context of certain other DBPs (pink bars) and these shape-sites show specific spacing biases with the shape- and/or sequence-sites of the co-binding partner (blue bars), (D-K) Shape-motifs suggest models for genomic occupancy of dimer complexes where sequence-motifs are inadequate. (D-F) The shape-motifs for TBX5 and NKX2–5; note that the ProT-motif for NKX2–5 has the consensus sequence of TBX5 in its underlying sequences (also see Figure S9), (G) TBX5 and NKX2–5 shape-sites in a 22-bp DNA-sequence (from mouse Nppa promoter) where the two DBPs are known to bind. Crystal structure of the ternary complex comprising TBX5, NKX2–5, and DNA is from our previous study (Luna-Zurita, Stirnimann et al. 2016), (H-K) The shape-motifs enriched under MYC-peaks vs. MYC-unbound MAX-peaks; note that the HelT-motif has the E-box sequence in its underlying sequences, but the underlying sequences overall result in a motif with low specificity.
Figure 7.
Figure 7.
bZIP family DBPs utilize distinct shape-motifs and in different combinations. Shape-motifs for ten bZIP proteins; H, M, P, and R denote motifs for HelT, MGW, ProT, and Roll, respectively. X-axis shows the positions along the shape-motif; Y-axis shows the range of shape-feature at each position of the shape-motif. A feature is omitted for a DBP if the DBP does not have a shape-motif for that feature. DBPs have shape-motifs for different combinations of features. When two DBPs have a shape-motif for the same feature, they may be broadly similar (e.g., FOSL1 and CEBPB HelT) or quite distinct (e.g., FOS1 versus NRF1 ProT).

References

    1. Abe N, Dror I, Yang L, Slattery M, Zhou T, Bussemaker HJ, Rohs R and Mann RS (2015). “Deconvolving the recognition of DNA shape from sequence.” Cell 161(2): 307–318. - PMC - PubMed
    1. Afek A, Schipper JL, Horton J, Gordan R and Lukatsky DB (2014). “Protein-DNA binding in the absence of specific base-pair recognition.” Proc Natl Acad Sci U S A 111(48): 17140–17145. - PMC - PubMed
    1. Aggarwal AK, Rodgers DW, Drottar M, Ptashne M and Harrison SC (1988). “Recognition of a DNA operator by the repressor of phage 434: a view at high resolution.” Science 242(4880): 899–907. - PubMed
    1. Aishima J and Wolberger C (2003). “Insights into nonspecific binding of homeodomains from a structure of MATalpha2 bound to DNA.” Proteins 51(4): 544–551. - PubMed
    1. Arvey A, Agius P, Noble WS and Leslie C (2012). “Sequence and chromatin determinants of cell-type-specific transcription factor binding.” Genome Res 22(9): 1723–1734. - PMC - PubMed

Publication types

LinkOut - more resources