Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb;28(2):171-181.
doi: 10.1101/gr.226530.117. Epub 2018 Jan 5.

Local sequence features that influence AP-1 cis-regulatory activity

Affiliations

Local sequence features that influence AP-1 cis-regulatory activity

Hemangi G Chaudhari et al. Genome Res. 2018 Feb.

Abstract

In the genome, most occurrences of transcription factor binding sites (TFBS) have no cis-regulatory activity, which suggests that flanking sequences contain information that distinguishes functional from nonfunctional TFBS. We interrogated the role of flanking sequences near Activator Protein 1 (AP-1) binding sites that reside in DNase I Hypersensitive Sites (DHS) and regions annotated as Enhancers. In these regions, we found that sequence features directly adjacent to the core motif distinguish high from low activity AP-1 sites. Some nearby features are motifs for other TFs that genetically interact with the AP-1 site. Other features are extensions of the AP-1 core motif, which cause the extended sites to match motifs of multiple AP-1 binding proteins. Computational models trained on these data distinguish between sequences with high and low activity AP-1 sites and also predict changes in cis-regulatory activity due to mutations in AP-1 core sites and their flanking sequences. Our results suggest that extended AP-1 binding sites, together with adjacent binding sites for additional TFs, encode part of the information that governs TFBS activity in the genome.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Comparison of 41 LOW activity and 40 HIGH activity sequences. (A) Expression distribution for selected AP-1-containing sequences. HIGH activity sequences (orange) drive stronger expression than LOW sequences (blue). (B) Distributions of AP-1 motif scores for HIGH (orange) and LOW (blue) sequences. Scores were derived using FIMO (Grant et al. 2011) from the JUNB position weight matrix (PWM) in the JASPAR database (Bryne et al. 2008) (inset). (C) Dependency of expression on intact AP-1 binding sites. Expression driven by wild-type sequences (x-axis) plotted versus expression driven by sequences with inactivated AP-1 sites (y-axis). Most points are below the diagonal (solid black line), indicating the importance of an intact AP-1 binding site for cis-regulatory activity. (D) Schematic shows five different length variants created for each sequence. (E) HIGH and LOW sequences contain regulatory information near the AP-1 core motif. Box plots for the activities of 40 HIGH (orange) and 41 LOW (blue) sequences are shown for different length variants. (x-axis) Total length of sequence including the AP-1 core; (y-axis) expression measured in MPRA assay. HIGH and LOW sequences are significantly different at 50 bp (Wilcoxon test, P = 1.86 × 10−9).
Figure 2.
Figure 2.
Saturation mutagenesis of AP-1 dependent elements. (A,B) Results from saturation mutagenesis of an AP-1-dependent sequence: (x-axis) nucleotide position in the element; (vertical gray box) position of the AP-1 core motif; (y-axis) expression of point mutants relative to the parental sequence. Point mutations were created in the context of either the wild-type AP-1 site parent (blue) or the AP-1mut site parent (red). Substitutions that were not significantly different from wild-type are faded. The dashed box in A shows a cluster of mutants that affect expression only when the wild-type AP-1 binding site is present, and are thus designated as “interacting” substitutions. The dashed box in B shows a cluster of mutants in the first 10 bp that reduce expression in both the WT and AP-1mut background; thus they contribute to the activity of the sequence “independent” of AP-1. (C) The spatial distributions of independent (gray) and interacting (tan) features in the 20 sequences subjected to saturation mutagenesis. Interacting features tend to occur closer to the AP-1 core motif.
Figure 3.
Figure 3.
Identities and locations of potential independent and interacting TFs: (x-axis) position along regulatory sequence; (y-axis) transcription factor identity. Each row represents one TF, and points depict interacting (▴) or independent (•) substitutions under the TF motif at the indicated position on the x-axis. Colors represent the regulatory sequence in which the TF was present. When multiple TFs with similar motifs matched a single substitution pattern, one representative TF was chosen.
Figure 4.
Figure 4.
K-mers that distinguish HIGH and LOW DHS are enriched within 10 bp of the AP-1 core motif. (A) Expression distribution for 5000 sequences in DHS regions containing AP-1 sites. (x-axis) Log2(RNA/DNA) counts for barcodes representing particular cis-regulatory sequences (left). The top 1000 sequences are annotated as HIGH (orange), and the bottom 1000 sequences are annotated as LOW (blue). Motifs derived from HIGH and LOW sequences using MEME motif discovery tools (right). (B) gkm-SVM distinguishes between HIGH and LOW sequences within DHS. Precision-recall curve for a 10-mer gkm-SVM model trained on HIGH and LOW sequences (AUC = 0.91). Error bars show standard error from fivefold cross-validation. (C) K-mers that distinguish HIGH and LOW sequences overlap the AP-1 binding site. (x-axis) Position of the center of the k-mer along regulatory element in bp; (gray box) position of the AP-1 core motif; (y-axis) k-mer weight from gkm-SVM in B. Each point is an individual k-mer, and the size of the point denotes the number of sequences containing the k-mer. The color of the point indicates whether the k-mer was found in HIGH (orange) or LOW (blue) sequence. The top 400 k-mers, 200 with positive weights and 200 with negative weights are shown. (D) In silico deletion experiment also highlights that most informative k-mers overlap with the AP-1 core motif. A 10-bp region of every sequence was masked (horizontal black lines), and a 10-mer gkm-SVM model was refit. The x-axis shows the position of the masked segment along regulatory elements, and the y-axis shows the area under precision-recall curve from the resulting model. The gray box depicts the position of the AP-1 core motif, and the red line connecting the centers of the black bars highlights the trend of AUC values across the sequence. (E) Specification of HIGH and LOW groups lies within the central 12 bps. Sequences were shortened by removing one base from both ends and a 6-mer gkm-SVM model was refit: (x-axis) length of the shortened sequence; (y-axis) area under precision-recall curve. The red line connecting the centers the points highlights the trend of AUC values across the sequence.
Figure 5.
Figure 5.
gkm-SVM scores quantitatively predict expression of wild-type sequences and effect of mutations. (A) gkm-SVM classifier trained on HIGH and LOW DHS sequences quantitatively predicted expression of sequences from the first library: (x-axis) expression of 81 cis-regulatory sequences; (y-axis) SVM score from the 10-mer gkm-SVM model in Figure 4B. (B) gkm-SVM classifier trained on HIGH and LOW DHS sequences accurately predicted the effect of substitutions in the AP-1 core motif and 2 bp flanking the AP-1 core tested in the saturation mutagenesis library. Data for substitutions in one sequence (Chr 3: 128734834–128734883) are shown: (x-axis) expression of sequences containing one substitution each; (y-axis) SVM score from 10-mer gkm-SVM model in Figure 4B. The model predicted loss in expression from wild-type sequence (black) when the substitutions are made in the AP-1 core (orange), and the effect of substitutions in 2 bp flanking the AP-1 site (red). Most substitutions outside of the core +2 bp flank (blue) have high expression and are not well predicted by the SVM. (C) Predictive power of the gkm-SVM model is inversely proportional to the absolute distance from AP-1 binding site. Substitutions from all 20 sequences in the saturation mutagenesis library were grouped by their distance from the AP-1 binding site: (x-axis) distance of the group of substitutions to the AP-1 core motif center; (y-axis) correlation coefficient between change in expression and change in SVM score compared to wild-type sequence for all substitutions in a group.
Figure 6.
Figure 6.
AP-1 sites that bind multiple AP-1 binding proteins in forward and reverse orientations drive high activity. (A) A logistic regression model with additive terms for motifs matches for 28 TFs can distinguish between HIGH and LOW groups of DHS AP-1-containing sequences. Precision-recall curves for models with PWMs for 28 TFs (black, AUC = 0.9), six AP-1 PWMs (JUNB, MAF::NFE2, NFE2l2, FOSL1, MAFK, NFE2; blue, AUC = 0.86), and JUNB (red, AUC = 0.77) are shown. Error bars denote standard error from fivefold cross-validation. Motif matches in both orientations were included. (B) Ignoring orientation information of motifs reduced the predictive power of logistic regression models, especially for a model with JUNB alone. Precision-recall curves for models with PWMs for 28 TFs (black, AUC = 0.89), six AP-1 PWMs (blue, AUC = 0.84), and JUNB (red, AUC = 0.64) are shown again. (C) Expression of DHS sequences containing AP-1 sites in MPRA is correlated with genomic binding of AP-1 TFs to those sequences. (x-axis) Number of peaks observed in total in five ChIP-seq experiments (JUNB, MAFF, MAFK, NFE2 and FOSL1) (The ENCODE Project Consortium 2012); (y-axis) observed log2 (RNA/DNA) counts of cis-regulatory sequences. Expression distributions for sequences with three or more ChIP-seq peaks were not significantly different from each other (Wilcoxon test, Bonferroni-corrected P > 0.05); all other distributions were significantly different from each other.

References

    1. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37: W202–W208. - PMC - PubMed
    1. Biddie SC, John S, Sabo PJ, Thurman RE, Johnson TA, Schiltz RL, Miranda TB, Sung M-H, Trump S, Lightman SL, et al. 2011. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol Cell 43: 145–155. - PMC - PubMed
    1. Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A. 2008. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res 36: D102–D106. - PMC - PubMed
    1. Chen X, Ji Z, Webber A, Sharrocks AD. 2016. Genome-wide binding studies reveal DNA binding specificity mechanisms and functional interplay amongst Forkhead transcription factors. Nucleic Acids Res 44: 1566–1578. - PMC - PubMed
    1. Chinenov Y, Kerppola TK. 2001. Close encounters of many kinds: Fos-Jun interactions that mediate transcription regulatory specificity. Oncogene 20: 2438–2452. - PubMed

Publication types

Substances

LinkOut - more resources