Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun;38(11):e126.
doi: 10.1093/nar/gkq217. Epub 2010 Apr 7.

De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis

Affiliations

De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis

Valentina Boeva et al. Nucleic Acids Res. 2010 Jun.

Abstract

Dramatic progress in the development of next-generation sequencing technologies has enabled accurate genome-wide characterization of the binding sites of DNA-associated proteins. This technique, baptized as ChIP-Seq, uses a combination of chromatin immunoprecipitation and massively parallel DNA sequencing. Other published tools that predict binding sites from ChIP-Seq data use only positional information of mapped reads. In contrast, our algorithm MICSA (Motif Identification for ChIP-Seq Analysis) combines this source of positional information with information on motif occurrences to better predict binding sites of transcription factors (TFs). We proved the greater accuracy of MICSA with respect to several other tools by running them on datasets for the TFs NRSF, GABP, STAT1 and CTCF. We also applied MICSA on a dataset for the oncogenic TF EWS-FLI1. We discovered >2000 binding sites and two functionally different binding motifs. We observed that EWS-FLI1 can activate gene transcription when (i) its binding site is located in close proximity to the gene transcription start site (up to approximately 150 kb), and (ii) it contains a microsatellite sequence. Furthermore, we observed that sites without microsatellites can also induce regulation of gene expression--positively as often as negatively--and at much larger distances (up to approximately 1 Mb).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Main steps of the MICSA pipeline.
Figure 2.
Figure 2.
Performance comparison of MICSA with FindPeaks, PeakSeq, QuEST and uSeq. As a positive set of binding sites of NRSF we used (A) 3000 best matches of the canonical NRSF matrix in the human genome, (B) 500 best matches of the canonical NRSF matrix in the human genome, (C) 83 q-PCR verified NRSF-binding sites in the human genome. Peaks extracted by each algorithm were ranked according to in-built scores or P-values. For each number of top peaks the frequency of identified positive sites among them was plotted. ‘ToolName^ ’ means that the default parameters of the tool were modified to make it report more peaks.
Figure 3.
Figure 3.
Binding motifs identified by MICSA in ChIP-Seq data for GABP, STAT1 and CTCF resemble canonical motifs. (A) GABP motif logos [Weblogos (32)], canonical motif from (Genomatix, http://www.genomatix.de), (B) STAT1 motif logos (21), (C) motif logos for CTCF (22).
Figure 4.
Figure 4.
Motifs identified by MICSA in EWS-FLI1 ChIP-Seq data resemble but are not identical to the canonical binding motif of FLI1. (A) Consensus motifs identified by MICSA [Weblogos (32)], (B) canonical motif for ETS family of TFs including the TF FLI1 (26), (C) canonical motif for the TF FLI1 (27).
Figure 5.
Figure 5.
Histogram of distances between predicted/random peaks and genes up/downregulated by EWS-FLI1. (A) Predicted sites containing (GGAA)n microsatellites; (B) ETS sites (site without microsatellites). EWS-FLI1 binding to GGAA microsatellites results in significant expression activation of neighboring genes. EWS-FLI1 binding to single ETS sites can produce both negative and positive effects on transcription of neighboring genes. The P-values were directly evaluated by Monte-Carlo simulations of random peaks. Distances from the TSSs of modulated genes to random peaks (iterative trials) and to predicted sites were calculated. The P-values correspond to the probability to get at least the observed number of distances falling within a given 50-kb window, under the hypothesis that peaks are randomly distributed and their coordinates are independent of coordinates of TSSs of EWS-FLI1 modulated genes. Bars above the dashed line correspond to a P-value <0.05.

References

    1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 2007;4:651–657. - PubMed
    1. Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics. 2004;83:349–360. - PubMed
    1. Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008;24:1729–1730. - PMC - PubMed

Publication types