Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 7;85(15):2900-2918.e16.
doi: 10.1016/j.molcel.2025.07.005.

Chromatin-dependent motif syntax defines differentiation trajectories

Affiliations

Chromatin-dependent motif syntax defines differentiation trajectories

Sevi Durdu et al. Mol Cell. .

Abstract

Transcription factors (TFs) recognizing DNA motifs within regulatory regions drive cell identity. Despite recent advances, their specificity remains incompletely understood. Here, we address this by contrasting two TFs, Neurogenin-2 (NGN2) and MyoD1, which recognize ubiquitous E-box motifs yet drive distinct cell fates toward neurons and muscles, respectively. Upon induction in mouse embryonic stem cells, we monitor binding across differentiation, employing an interpretable machine learning approach that integrates preexisting DNA accessibility. This reveals a chromatin-dependent motif syntax, delineating both common and factor-specific binding, validated by cellular and in vitro assays. Shared binding sites reside in open chromatin, locally influenced by nucleosomes. In contrast, factor-specific binding in closed chromatin involves NGN2 and MyoD1 acting as pioneer factors, influenced by motif variant frequencies, motif spacing, and interaction partners, which together account for subsequent lineage divergence. Transferring our methodology to other models demonstrates how a combination of opportunistic binding and context-specific chromatin-opening underpin TF specificity, driving differentiation trajectories.

Keywords: E-box; cell differentiation; chromatin accessibility; gene regulation; machine learning; motif syntax; motif variants; pioneer factors; predictive models; transcription factor specificity.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
NGN2 and MyoD1 induces neurogenesis and myogenesis, respectively, when expressed in mESCs despite recognizing similar short DNA sequence motifs (A) Immunofluorescence images of mESCs prior and 2-days post-induction of NGN2 and MyoD1, resulting in distinct cell types. Nuclear marker Hoechst in magenta, neuronal marker TubIII, and myocyte marker α-actinin in green. Scale bar, 20 μm. The logos represent the E-box motif, de novo identified by HOMER, considering the 500 most enriched ChIP-seq peaks (Figures S2A and S2B). (B) Differentiation trajectories illustrated by principal-component analysis (PCA) of the transcriptomes for mESCs (blue), NGN2 (green), MyoD1 (purple), and noTF (gray) control induction. (C) Genome browser tracks of the Dll1 gene locus with E-box motifs (black bars) illustrating shared and factor-specific binding by NGN2 and MyoD1 at 6 h.
Figure 2
Figure 2
DNA accessibility and cognate-motif occurrences are major factors underlying initial binding patterns (A) Density plot showing preexisting DNase-seq (DHS 0 h) and 6 h post-induction NGN2 ChIP-seq enrichments in all 1.3 million NGN2-motif-containing genomic regions (500 bp with minimum one non-overlapping cognate-motif, described in Figure S2E). Percentage of total sites in each quarter indicates that the majority of cognate motifs reside in closed chromatin and will not be bound by NGN2 upon induction, yet the heterochromatic but bound regions make up one third of all NGN2 binding. (B) Example cognate-motif sites for open-bound, closed-bound, open-unbound, and closed-unbound categories are visualized as genomic tracks of 0 h DNase-seq, NGN2 motifs, 6 h NGN2 ChIP-seq, and control GFP ChIP-seq. (C and D) Percentage of bound cognate-motif sites (enrichment > 1.5) grouped in bins according to their accessibility prior to induction, showing higher likelihood of binding (>50%) in open regions and low likelihood (<1%) in closed regions, indicating the impact of DNA accessibility. (E and F) Frequency of multiple motifs (gray: one, light-blue: two, and dark-blue: three or more cognate motifs) in consensus ChIP-seq peaks and rest of the cognate-motif sites in the genome, grouped in bins according to their accessibility prior to induction (e.g., accessibility < 6), showing closed-bound sites contain more multi-motifs. (G and H) Fitted line plots of average ChIP-seq enrichment versus DHS 0 h at single cognate-motif sites (residing in consensus ChIP-seq peaks and their matched backgrounds), grouped by the frequent central nucleotides of their E-box cognate motifs. The separation in the curves of motif variants at a given accessibility illustrates the difference in the likelihood to be bound. (I) Heatmap of combined NGN2 and MyoD1 peaks, split in three groups: strongly bound by both factors (N and M), exclusively bound by one factor (N or M), and preferentially bound by one of the two (N > M or M > N), visualizing DHS 0 h (red), NGN2 ChIP-seq (green), MyoD1 ChIP-seq (purple), and cognate-motif variants with TA, GA, GC, and GG central nucleotides (black). (J) Genomic regions with exclusive or overlapping NGN2 and MyoD1 binding, exemplifying the effect of chromatin accessibility and occurrence of cognate-motif variants (green: NGN2-preferred, pruple:MyoD1-preferred) present at the peak centers.
Figure 3
Figure 3
Identification of a chromatin-dependent motif syntax using a CNN (A) Scheme of the applied CNN for predicting the primary 6 h binding of NGN2 and MyoD1, using DNA sequence and continuous 0 h DHS (DNase I) signal as input. (B) Contrasting frequency-based motif representation (PFM), informing about the frequency of the DNA bases in the enriched motifs versus contribution weight matrices (CWM), identified by the CNN model, and informing about the contribution of the DNA bases to the degree of binding enrichments. DNA sequences are represented with the letters A, C, G, T, and accessibility with the letter D. Differences in PFM and CWM highlight the binding strength contribution of flanking nucleotides and less frequent central nucleotides. The “D” profile indicates high contribution of accessibility directly overlapping with the E-box motif and lower contribution of the surrounding. (n = 26,128 E-box instances for NGN2 and 25,327 for MyoD1). (C) Calculation of NGN2 and MyoD1’s relative affinities to E-box motif variants independent of the genomic sequence context at any given accessibility using the CNN model on synthetic sequences (average gained binding strength upon placing the same motif across various backgrounds with simulated accessibility). The predicted binding strengths across different 0 h accessibility are represented from blue to red. (D) Single-locus display of DNA sequence, 0 h accessibility profile and their deconvolved contribution scores for NGN2 6 h binding prediction, reporting on the features accounting for the binding strength. (E) Predicted NGN2 binding strength contribution of the flanking bases (5′-flanking in red, 3′ in green) for CAGATG motif. (F) Contribution of the flanking bases to the binding prediction of E-box motif variants (central nucleotides in blue, 3′-flanking nucleotides in green). (G) Comparison of pre-induction (0 h) MNase and DNase I digestion profiles at NGN2 cognate-motif sites residing in open chromatin, grouped as sites that will be bound or not bound by NGN2 upon 6 h induction. (H) DNA sequence motif (seqlet, identified by TF-MoDISco), containing two E-box motifs, that is informative for NGN2 binding prediction at bound sites residing in closed chromatin. (I) Fraction of NGN2 bound-closed sites with two NGN2 motifs at various base pair distances between motif centers. Inner: modeled NGN2 binding (AlphaFold3) on two 11-bp apart (highest binding potential) NGN2 motifs.
Figure 4
Figure 4
Motif variant-specific chromatin and transcription responses (A) Uniform manifold approximation and projection (UMAP) of relevant genomic regions (represented as single points, positioned based on similarities in accessibility and TF binding) visualizing initial accessibility, TF ChIP-seq and fold-changes in ATAC-seq, H3K27ac, and RNA Pol II ChIP-seq, colored from blue to red. Bound regions with low initial accessibility or factor-specific binding are outlined with dashed lines. Curved lines highlight NGN2-specific (green), MyoD1-specific (purple), and overlapping (brown) binding. The changes highlight increased activity in closed-bound sites, though with variable degrees (e.g., for NGN2 factor-specific sites, binding strength is graded from left to right and chromatin opening from top to bottom). (B) Example genomic locus, showing regions with similar 0 h accessibility and 6 h NGN2 binding, yet gaining different degrees of activity. (C) Contrasting the binding and activity on the motif variants using CNN model predictions on synthetic sequences. Predicted 6 h binding, change in ATAC-seq and H3K27ac ChIP-seq upon NGN2 induction at NGN2 E-box motifs with TA, GA, and GC central nucleotides across all accessibility-ranges are shown as heatmap from blue to red. The outlined box highlights the differences between the binding strength and its impact on chromatin for CAGATG versus CATATG motifs (at DHS 0.3, representing low initial accessibility with high dynamic range for gain of activity upon binding). (D and E) Contrasting predicted binding and accessibility-change upon NGN2 or MyoD1 induction at CAGATG or CAGCTG motif (DHS 0.3) depending on the flanking bases. (F) A library of barcoded transcriptional reporters, inserted stably in a defined locus, designed to report the effect of NGN2 motif variations on gene expression upon NGN2 induction. GA: single GA motif, GA + 40 + GA: two GA motifs with 40 bp between them. GA + 20 + GA: two GA motifs with 20 bp between them. GA + GA + GA: three GA motifs. GA + NN: one GA motif and another motif with the indicated central nucleotide and 3′-flanking nucleotides. Values correspond to differences in gene expression between respective construct relative to a scrambled motif control at the same time point post NGN2 induction (Figure S4H; Table S5). In (C)–(F), central nucleotides in blue, 5′-flanking in red, 3′-flanking in green.
Figure 5
Figure 5
Homodimer and heterodimer formation with other bHLH proteins increase the regulatory repertoire of NGN2 and MyoD1 (A) Mass spectrometric detection of proteins interacting with NGN2 (enrichment relative to GFP > 5 with −log10(p value) > 5). bHLH TFs are labeled in red, chromatin-associated proteins in blue (Table S8). (B) Scheme of DAP-seq to probe genomic DNA-binding preferences in vitro, for NGN2 and MyoD1 alone and in combination with their putative bHLH partners: TCF3 (present as two main isoforms: E12 and E47), TCF4, and TCF12. (C) Differential E-box k-mer enrichment analysis (using top 5,000 DAP-seq enriched peaks) among NGN2, MyoD1, and TCF homodimers and heterodimers. (D) Differential E-box k-mer enrichment analysis among NGN2-bound regions, grouped based on change in 6 h NGN2 binding upon decreasing TCF levels in comparison with control treatment. (KD: small interfering RNA (siRNA)-knockdown). (E) Images of 48 h NGN2-induced cells for inspection of cellular morphology (cytoplasmic GFP signal, segmented for axons, in green and cell bodies in blue) treated with control (non-targeting siRNA), Tcf3, Tcf4, and Tcf12 siRNA. Scale bar, 50 μm.
Figure 6
Figure 6
Reshaped accessibility landscape diverges NGN2 and MyoD1 genomic binding (A) UMAP of genomic regions (as in Figure 4A) displaying initial accessibility, NGN2 and MyoD1 binding at 6 and 24 h. Dashed lines outline NGN2 and MyoD1-specific regions at 6 h and new regions at 24 h. The decrease in binding signal at the overlapping regions (high 0 h DHS) and gain in new regions (low 0 h DHS) illustrate divergence of the binding patterns of NGN2 and MyoD1 through differentiation. (B) ATAC-seq fold-changes upon 24 h NGN2 and MyoD1 induction, visualized on UMAP, show loss of accessibility at co-bound high DHS-0 h regions and gain in new, low DHS-0 h regions. Outlines indicate 24 h-gained NGN2-binding sites in green, 24 h-gained MyoD1-binding sites in purple, and sites that gain accessibility but not NGN2 binding in blue. (C) Differential motif enrichment analysis (JASPAR TF-catalog) for NGN2 and MyoD1, contrasting their 6 versus 24 h binding. Top-two-enriched unique motifs and top-enriched E-box motifs for each time point are visualized. Combined peaks of 6 and 24 h binding are binned according to changes in their binding enrichment between 6 and 24h (Figures S6C and S6D). (D) Genome browser tracks of NGN2 ChIP-seq, ATAC-seq, H3K27Ac ChIP-seq, RNA Pol II ChIP-seq, and RNA-seq at the Ebf3 locus. (E) NGN2 motif density around EBF motifs that are newly made accessible upon 24 h NGN2 induction, sub-grouped as NGN2 bound and not bound at 24 h (corresponding to the outlined regions in (B), green versus blue). (F) Metaplots showing ATAC-seq signal pre- and post-NGN2 induction at 24 h, grouped as in (E). Regions with an EBF motif (blue) show chromatin opening. Regions with an EBF motif as well as an NGN2 motif, within 100 bp, (green) gain new NGN2 binding and high accessibility. (G) Schematic summary of dynamic NGN2/MyoD1 binding through differentiation.
Figure7
Figure7
The models, trained in mESCs, predict TF binding and activity in other cellular contexts (A) Heatmaps of MyoD1 ChIP-seq in mESCs (6 h) and C2C12 cells (24 h) upon induction, together with their pre-induction accessibility in mESCs (DNase-seq 0 h) and in C2C12 cells (ATAC-seq 0 h), visualized on combined MyoD1-binding sites. (B) Enriched TF-MoDISco E-box seqlets in MyoD1 bound sites in mESCs and C2C12 cells. (C) Venn diagram contrasting MyoD1 ChIP-seq peaks in mESCs and C2C12 cells illustrating low overlap among binding sites. (D) Schematic of CNN model transfer: model trained for TF binding in mESCs is applied to predict binding in other cell types (left). Correlations are shown between observed-MyoD1-binding in C2C12 cells and observed-MyoD1-binding in mESCs or predicted-MyoD1-binding in C2C12 cells by the model trained in mESCs (right). (E) Predicted NGN2 binding enrichments in all murine ENCODE cCREs (rows) depending on simulated initial DNA accessibility (columns), sorted based on NGN2 binding potential at low DHS (left). Top 5,000 tissue-specific cCREs with highest NGN2 binding potential at low DHS are contrasted with 2,000 randomly sampled tissue-specific cCREs from the bottom 50% of the left panel (middle). Accessibility (DNase-seq) of the cCREs, in top-enriched ENCODE tissue profiles and profiles closest to background (right). The comparison indicates that cCREs, with high NGN2 binding potential, gain activity in cerebellum and brain. Example cCREs proximal to the genes upregulated in mESCs upon NGN2 induction are labeled. (F) An example GWAS defined SNP (rs58130172) changing an E-box variant, in a tissue-specific human cCRE, active in brain. Scheme describes the SNP location at DISP3 gene and the effected sequence (top). Predicted NGN2 binding and activity change at the reference and alternative cCRE sequence depending on initial accessibility (bottom). (G) Measured effect of the rs58130172 SNP on DISP3 gene expression retrieved from eQTL data.

References

    1. Wunderlich Z., Mirny L.A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 2009;25:434–440. doi: 10.1016/J.TIG.2009.08.003. - DOI - PMC - PubMed
    1. Ephrussi A., Church G.M., Tonegawa S., Gilbert W. B lineage--specific interactions of an immunoglobulin enhancer with cellular factors in vivo. Science. 1985;227:134–140. doi: 10.1126/SCIENCE.3917574. - DOI - PubMed
    1. Longo A., Guanga G.P., Rose R.B. Crystal structure of E47-NeuroD1/beta2 bHLH domain-DNA complex: heterodimer selectivity and DNA recognition. Biochemistry. 2008;47:218–229. doi: 10.1021/BI701527R. - DOI - PubMed
    1. de Martin X., Sodaei R., Santpere G. Mechanisms of Binding Specificity among bHLH Transcription Factors. Int. J. Mol. Sci. 2021;22 doi: 10.3390/ijms22179150. - DOI - PMC - PubMed
    1. Guo J., Li T., Schipper J., Nilson K.A., Fordjour F.K., Cooper J.J., Gordân R., Price D.H. Sequence specificity incompletely defines the genome-wide occupancy of Myc. Genome Biol. 2014;15 doi: 10.1186/S13059-014-0482-3. - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources