Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan 24;9(1):e86044.
doi: 10.1371/journal.pone.0086044. eCollection 2014.

A new exhaustive method and strategy for finding motifs in ChIP-enriched regions

Affiliations

A new exhaustive method and strategy for finding motifs in ChIP-enriched regions

Caiyan Jia et al. PLoS One. .

Abstract

ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (l, d) motifs (l ≤ 10, d ≤ 2) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (l, d) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Motifs in 12 mouse ES Cell ChIP-seq data sets.
FMotif was tested using mouse ChIP-seq data sets for 12 DNA-binding TFs (CTCF, cMyc, Esrrb, Klf4, Nanog, nMyc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, and Zfx) involved in mouse embryonic stem cell pluripotency and self-renewal . Results from CisFinder and published motifs in literature are shown for comparison. ‘Nb’ indicates the number of peak-enriched regions predicted by the peak-calling program MACS with an FDR threshold of 0.2 or a formula image-value threshold of formula image, ‘FMotif’ and ‘CisFinder’ indicate the closest matching motif logos found by these programs (all motif logos were generated using the web-based tool Weblogo [45]), ‘Literature’ indicates the corresponding motif logos published in literature, ‘formula image’ indicates the number of binding sites found by either FMotif or CisFinder, and ‘Rank’ after ‘formula image’ is the ranking number of a reported motif found by either FMotif or CisFinder.
Figure 2
Figure 2. Motifs in 4 human TF ChIP-seq data sets.
FMotif was tested with four widely used human ChIP-seq data sets for four DNA-binding TFs including CTCF (CCCTC-binding factor , named CTCF(formula image)), FoxA1 (hepatocyte nuclear factor 3formula image [42]), NRSF (neuron-restrictive silencer factor [2]), and STAT1 (signal transducer and activator of transcription protein [1]). Results from CisFinder and published motifs in literature are shown for comparison. Column definitions are the same as those in Figure 1.
Figure 3
Figure 3. FMotif sensitivity.
FMotif sensitivity was measured using an NRSF-positive TFBS set (NRSF/qPCR), which was composed of 83 binding sites verified by qPCR , four yeast DNA-binding TFs (Reb1, Gal4, Phd1, and Rap1), and one human TF (CTCF) ChIP-exo data sets. Results from CisFinder and published motifs in literature are shown for comparison. Column definitions are the same as those in Figure 1.
Figure 4
Figure 4. An example of a suffix tree and a tree representation of pattern space.
(a) The suffix tree of the sequence GAGAC. (b) A tree representation of pattern space in the search for an (formula image, formula image) motif.
Figure 5
Figure 5. An example of a (4,1) motif search using FMotif.
Figures (a)–(f) illustrate the search process of (4, 1) motifs on the mismatched suffix tree of the sequence GAGAC.

References

    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. (2007) Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4: 651–657. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein- DNA interactions. Science 316: 1497–1502. - PubMed
    1. Wilbanks EG, Facciotti MT (2010) Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 5: e11471. - PMC - PubMed
    1. Rhee, S H, Pugh BF (2011) Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147: 480–483. - PMC - PubMed
    1. Zhao Y, Stormo GD (2011) Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotechnol 29: 1408–1419. - PMC - PubMed

Publication types