Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Mar 23;3(3):e39.
doi: 10.1371/journal.pcbi.0030039.

Discovering motifs in ranked lists of DNA sequences

Affiliations

Discovering motifs in ranked lists of DNA sequences

Eran Eden et al. PLoS Comput Biol. .

Abstract

Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP-chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP-chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP-chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP-chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. DRIM Flow Chart
DRIM receives a list of DNA sequences as input and a criterion by which the sequences should be ranked, for example, TF binding signals as measured by ChIP ChIP–chip: (i) The sequences are ranked according to the criterion. (ii) A “blind search” is performed over all the motifs that reside in the restricted motif space (in this study the restricted motif space contains ∼100,000 motifs, see Methods, The DRIM software). For each motif an occurrence vector is generated. Each position in the vector is the number of motif occurrences in the corresponding sequence, (the figure shows the vector for the motif CACGTGW). (iii) The motif significance is computed using the mHG scheme, and the optimal partition into target and background sets in terms of motif enrichment is identified. The promising motif seeds are passed as input to the heuristic motif search model and the rest are filtered out. (iv,v) The motif seeds are expanded in an iterative manner (the mHG is computed in each lap), until a local optimum motif is found. (vi) The exact mHG p-value of the motif is computed. If it has a p-value < 10−3, then it is predicted as a true motif (the choice of this threshold is explained in Results, Proof of principle). The output of the system is the motif representation above IUPAC, its PSSM, mHG p-value, and optimal set partition cutoff.
Figure 2
Figure 2. Comparison between Predictions of DRIM and Published Predictions of Six Other Methods and Conservation Data as Reported in [25]
Overall, out of 162 unique TFs, DRIM identified significant motifs for 82 TFs with p-value <10−3. Out of the 162 TFs, DRIM and the other applications agree on 96 TFs: 27 TFs for which a similar motif was found and 69 TFs for which no significant motifs were found. There are five TFs for which the motifs predicted by DRIM and other applications differ; 11 for which the other applications identified motifs that DRIM did not; and 50 for which DRIM identified a motif that the other applications did not (for details see Tables S2 and S3). Sequence logos were generated using the RNA Structure Logo software [56].
Figure 3
Figure 3. Examples of TFs for Which DRIM Identifies Novel Motifs
We further investigated these motifs and show evidence of their biological function. YPD, H2O2, and SM denote the ChIP–chip experimental conditions [25] in which the motifs were identified.
Figure 4
Figure 4. The Hypothetical Regulatory Network of Aro80
Copies of the BSAro 80 motif (on the sense and antisense) are shown as rectangles on the promoter regions. (A) BSAro 80 is conserved in four strains of yeast as shown using the University of California Santa Cruz browser conservation plots. Aro80 regulates the utilization of secondary nitrogen sources such as aromatic amino acids by binding genes that participate in the catabolism of aromatic amino acids. We hypothesize that it also binds to its own promoter region and introduces a positive feedback self loop. (B) Part of the Aro80 promoter sequence is shown with bases of the BSAro 80 motif colored in red. Interestingly, there are three GATA binding sites that are adjacent to the BSAro 80 motif (bases colored in green). These sites bind GATA factors that are known to play a role in nitrogen catabolite repression. We hypothesize that they are also involved in the repression of Aro80 expression by physically binding to the region near BSAro 80, thus making it inaccessible to Aro80 binding. This in turn breaks the positive feedback loop and represses the expression of Aro80 itself and other Aro80 regulated genes.
Figure 5
Figure 5. Comparison between HG and mHG Enrichment
The mHG and HG methods were applied to ChIP–chip data of six TFs. The sequences were ranked according to the ChIP–chip binding signal, and the enrichment of the correct binding motif was recorded using mHG and HG with fixed target sets containing the top 10, 100, and 1,000 sequences as well as all sequences with ChIP–chip signal <10−3. All scores were corrected for multiple motif testing. The mHG score is also corrected for the multiple cutoff testing. The 10−3 and mHG cutoffs for each experiment are shown. It can be seen that the two cutoffs are significantly different and that for all the tested TFs mHG produces better results than HG in terms of enrichment of the true motif.
Figure 6
Figure 6. Comparison of the Target Sets Sizes as Determined by the Fixed versus the mHG Flexible Cutoffs
Each dot represents a ChIP–chip experiment where the x and y coordinates are the number of promoters with p < 10 (standard cutoff) and the number of promoters as determined by the mHG cutoff, respectively. The dotted line is x = y. TF names are given in Table S4.
Figure 7
Figure 7. Motif Occurrences in the Top 59 (of ∼6,000) Promoters That Were Ranked According to Met32 Binding Signal
A comparison is made between the data-driven mHG cutoff and the arbitrary fixed cutoff. It can be seen that the motifs are significantly more enriched when the list is partitioned using the mHG cutoff.
Figure 8
Figure 8. Two-Dimensional Grid Used for Calculating mHG p-Value
In this example N = 20, B = 10, p = 0.1. Light-shaded area describes all attainable values of n and b. Dark-shaded area describes the subset R: all values of n and b for which HGT(b;N,B,n) ≤ p. Two (0,0) → (N,B) paths are depicted, representing the binary label vectors λ 1 = {1,1,1,0,1,0,1,1,1,0,1,0,.0,0,0,1,0,1,0,0} and λ 2 = {0,0,0,1,0,1,1,1,0,0,0,1,1,0,1,0,0,1,1,1}. The path λ1 traverses R, demonstrating that mHG (λ 1) ≤ p. The path λ2 does not traverse R, demonstrating that mHG (λ 1) > p.
Figure 9
Figure 9. The Distribution of TFBS Occurrence Multiplicities per Intergenic Region in S. cerevisiae Is Shown for Five TFs Whose TFBS Motif Was Experimentally Verified
Note that the y-axis is logarithmic. It can be seen that in most instances the TFBS appears in either zero, one, or two copies per intergenic region.

Similar articles

Cited by

References

    1. Ren B, Robert F, Wyrick J, Aparicio O, Jennings E, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290:2306–2309. - PubMed
    1. Keshet I, Schlesinger Y, Farkash S, Rand E, Hecht M, et al. Evidence for an instructive mechanism of de novo methylation in cancer cells. Nat Genet. 2006;38:149–153. - PubMed
    1. Bussemaker H, Li H, Siggia E. Regulatory element detection using correlation with expression. Nat Genet. 2001;27:167–71. - PubMed
    1. Sinha S, Tompa M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002;30:5549–5560. - PMC - PubMed
    1. Sinha S, Tompa M. Ymf: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003;31:3586–3588. - PMC - PubMed

Publication types