. 2010 Jul 6;5(7):e11425.

doi: 10.1371/journal.pone.0011425.

MER41 repeat sequences contain inducible STAT1 binding sites

Christoph D Schmid¹, Philipp Bucher

Affiliations

Affiliation

¹ Swiss Institute of Bioinformatics, Ecole Polytechnique Fédérale de Lausanne SV ISREC (The Swiss Institute for Experimental Cancer Research) GR-BUCHER, Lausanne, Switzerland. Christoph.Schmid@unibas.ch

PMID: 20625510
PMCID: PMC2897888
DOI: 10.1371/journal.pone.0011425

MER41 repeat sequences contain inducible STAT1 binding sites

Christoph D Schmid et al. PLoS One. 2010.

. 2010 Jul 6;5(7):e11425.

doi: 10.1371/journal.pone.0011425.

Authors

Christoph D Schmid¹, Philipp Bucher

Affiliation

¹ Swiss Institute of Bioinformatics, Ecole Polytechnique Fédérale de Lausanne SV ISREC (The Swiss Institute for Experimental Cancer Research) GR-BUCHER, Lausanne, Switzerland. Christoph.Schmid@unibas.ch

PMID: 20625510
PMCID: PMC2897888
DOI: 10.1371/journal.pone.0011425

Abstract

Chromatin immunoprecipitation combined with massively parallel sequencing methods (ChIP-seq) is becoming the standard approach to study interactions of transcription factors (TF) with genomic sequences. At the example of public STAT1 ChIP-seq data sets, we present novel approaches for the interpretation of ChIP-seq data.We compare recently developed approaches to determine STAT1 binding sites from ChIP-seq data. Assessing the content of the established consensus sequence for STAT1 binding sites, we find that the usage of "negative control" ChIP-seq data fails to provide substantial advantages. We derive a single refined probabilistic model of STAT1 binding sequences from these ChIP-seq data. Contrary to previous claims, we find no evidence that STAT1 binds to multiple distinct motifs upon interferon-gamma stimulation in vivo. While a large majority of genomic sites with high ChIP-seq signal is associated with a nucleotide sequence resembling a STAT1 binding site, only a very small subset of the over 5 million potential STAT1 binding sites in the human genome is covered by ChIP-seq data. Furthermore a surprisingly large fraction of the ChIP-seq signal (5%) is absorbed by a small family of repetitive sequences (MER41). The observation of the binding of activated STAT1 protein to a specific repetitive element bolsters similar reports concerning p53 and other TFs, and strengthens the notion of an involvement of repeats in gene regulation. Incidentally MER41 are specific to primates, consequently, regulatory mechanisms in the IFN-STAT pathway might fundamentally differ between primates and rodents. On a methodological aspect, the presence of large numbers of nearly identical binding sites in repetitive sequences may lead to wrong conclusions about intrinsic binding preferences of TF as illustrated by the spacing analysis STAT1 tandem motifs. Therefore, ChIP-seq data should be analyzed independently within repetitive and non-repetitive sequences.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Content of binding sites to assess STAT1 ChIP-seq sets.**
The frequencies of binding sequences TCCNNNGAA in 4 sets of genomic loci derived from STAT1 ChIP-seq data. ‘ChIP_peak’ contains STAT1 sites as defined in this study, FindPeaks (Robertson et al.), SISSR (Jothi et al.), and MACS (Zhang et al.) are derived of analyses of the identical ChIP-seq data set and PeakSeq (Rozowski et al.) uses an independent STAT1 ChIP-seq data set. The sets contain the top 3000 (left panel) or top 30 000 sites (center & right panel) of the corresponding data. The literature set consists of 37 STAT1 sites collected in the ORegAnno database. The occurrence of GAS is assessed in 100 bp sequence windows (SSA web server http://www.isrec.isb-sib.ch/ssa/). Plotted points represent the center positions of windows relative to the position of the putative STAT1 site predicted from ChIP-seq data. The right panel displays the frequency of the ISRE PWM (Transfac entry M00258).

**Figure 2. Descriptors of binding preferences of STAT1.**
Sequence logos (http://weblogo.berkeley.edu/) visualize the information content of occurrence frequencies of the 4 nucleotides by variable letter sizes at each position. Sequence logos of MAMOT motifs derived from in vivo ChIP-seq sites (a), repeat-filtered in vivo ChIP-seq (b), and in vitro SELEX sites (c). Note the symmetrical half sides with the presumable STAT1-interacting nucleotides are almost identical in a) to c). (d) Alignment of motif M2 (Jothi et al.) with the reverse complement of the consensus sequence of repetitive element MER41B (repbase). Table (e) specifies the PWM for STAT1 as visualized in panel a), each number indicating a score for matching nucleotides at corresponding positions.

**Figure 3. Estimation of PWM cutoff score from fraction of occupied STAT1 binding sites.**
5 454 192 potential STAT1 binding sites in the human genome are grouped according their PWM score. In general the number of occurrences decrease with increasing PWM score, i.e. 664 981 sites for PWM 20, 213 877 for 30, and 7471 for 42. Missing bars are due to lack of combinations generating the corresponding PWM score. Occupation is defined as the fraction of sites covered by more than 5 ChIP fragments (within +/−100 bp). Plotted are data derived from ChIP-seq experiments with unstimulated (open boxes) or IFN-γ-stimulated HeLa cells (filled boxes). The dotted line represents the occupation frequency (0.015) at a collection of 4819 random genomic sites in stimulated cells.

**Figure 4. Spacings of GAS tandems differ in repetitive sequences.**
Putatively ‘high-affinity’ tandem GAS (average PWM score >30) are classified according the spacing between the centers of two sites (x axis), and the induction ratios (y axis). For each spacing class, histograms representing the frequencies of corresponding log ratios are displayed in vertical orientation. Red color indicates location within repetitive sequence annotations and blue specifies tandem GAS within non-repetitive sequences. Two histograms at the bottom summarize the data above for sites with induced binding (log ratio >2). Within non-repetitive sequences spacings 18–22 bp are moderately enriched in induced GAS tandems. For induced sites within repetitive sequences, a clear predominance of spacing 21 bp is observed, mostly related with MER41 repeats.

**Figure 5. STAT1 sites unoccupied in HeLa cells nevertheless with increased phylogenetic conservation.**
STAT1 sites within non-repetitive sequences are classified according to the occupation by ChIP-seq tags and to their location either distant (>1 kb; solid lines) or close to annotations of TSS (dotted lines). For each of the sets (Table 3), the average PhastCons scores are computed at positions relative to the predicted STAT1 sites. In general STAT1 sites display a narrow increase of the average conservation score. (averages of PhastCons scores: genome wide 0.07; at TSS: 0.28). Closely neighboring TSS increase the average conservation, as well as higher ChIP-seq occupation tends to increased conservation at STAT1 sites. On the other hand TSS-associated STAT1 sites which lack any ChIP-seq tags still display a clearly augmented average conservation. This observation may suggest limited predictability of TF-binding in a specific cell type, even if information on nucleotide sequences with preferred binding (PWM) and on phylogenetic conservation are combined.

See this image and copyright information in PMC

References

1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. - PubMed
1. Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 2005;33:4899–4913. - PMC - PubMed
1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007;4:651–657. - PubMed
1. Schmid CD, Bucher P. ChIP-Seq Data Reveal Nucleosome Architecture of Human Promoters. Cell. 2007;131:831–832. - PubMed
1. Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, et al. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008;24:1729–1730. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MER41 repeat sequences contain inducible STAT1 binding sites

Affiliation

MER41 repeat sequences contain inducible STAT1 binding sites

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous