. 2007 Mar 23;3(3):e39.

doi: 10.1371/journal.pcbi.0030039.

Discovering motifs in ranked lists of DNA sequences

Eran Eden¹, Doron Lipson, Sivan Yogev, Zohar Yakhini

Affiliations

PMID: 17381235
PMCID: PMC1829477
DOI: 10.1371/journal.pcbi.0030039

Discovering motifs in ranked lists of DNA sequences

Eran Eden et al. PLoS Comput Biol. 2007.

. 2007 Mar 23;3(3):e39.

doi: 10.1371/journal.pcbi.0030039.

Authors

Eran Eden¹, Doron Lipson, Sivan Yogev, Zohar Yakhini

Affiliation

¹ Computer Science Department, Technion, Haifa, Israel. eraneden@cs.technion.ac.il

PMID: 17381235
PMCID: PMC1829477
DOI: 10.1371/journal.pcbi.0030039

Abstract

Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP-chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP-chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP-chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP-chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. DRIM Flow Chart**
DRIM receives a list of DNA sequences as input and a criterion by which the sequences should be ranked, for example, TF binding signals as measured by ChIP ChIP–chip: (i) The sequences are ranked according to the criterion. (ii) A “blind search” is performed over all the motifs that reside in the restricted motif space (in this study the restricted motif space contains ∼100,000 motifs, see Methods, The DRIM software). For each motif an occurrence vector is generated. Each position in the vector is the number of motif occurrences in the corresponding sequence, (the figure shows the vector for the motif CACGTGW). (iii) The motif significance is computed using the mHG scheme, and the optimal partition into target and background sets in terms of motif enrichment is identified. The promising motif seeds are passed as input to the heuristic motif search model and the rest are filtered out. (iv,v) The motif seeds are expanded in an iterative manner (the mHG is computed in each lap), until a local optimum motif is found. (vi) The exact mHG p-value of the motif is computed. If it has a p-value < 10⁻³, then it is predicted as a true motif (the choice of this threshold is explained in Results, Proof of principle). The output of the system is the motif representation above IUPAC, its PSSM, mHG p-value, and optimal set partition cutoff.

**Figure 2. Comparison between Predictions of DRIM and Published Predictions of Six Other Methods and Conservation Data as Reported in [25]**
Overall, out of 162 unique TFs, DRIM identified significant motifs for 82 TFs with p-value <10⁻³. Out of the 162 TFs, DRIM and the other applications agree on 96 TFs: 27 TFs for which a similar motif was found and 69 TFs for which no significant motifs were found. There are five TFs for which the motifs predicted by DRIM and other applications differ; 11 for which the other applications identified motifs that DRIM did not; and 50 for which DRIM identified a motif that the other applications did not (for details see Tables S2 and S3). Sequence logos were generated using the *RNA Structure Logo* software [56].

**Figure 3. Examples of TFs for Which DRIM Identifies Novel Motifs**
We further investigated these motifs and show evidence of their biological function. YPD, H₂O₂, and SM denote the ChIP–chip experimental conditions [25] in which the motifs were identified.

**Figure 4. The Hypothetical Regulatory Network of Aro80**
Copies of the BS_Aro ₈₀ motif (on the sense and antisense) are shown as rectangles on the promoter regions. (A) BS_Aro ₈₀ is conserved in four strains of yeast as shown using the University of California Santa Cruz browser conservation plots. Aro80 regulates the utilization of secondary nitrogen sources such as aromatic amino acids by binding genes that participate in the catabolism of aromatic amino acids. We hypothesize that it also binds to its own promoter region and introduces a positive feedback self loop. (B) Part of the Aro80 promoter sequence is shown with bases of the BS_Aro ₈₀ motif colored in red. Interestingly, there are three GATA binding sites that are adjacent to the BS_Aro ₈₀ motif (bases colored in green). These sites bind GATA factors that are known to play a role in nitrogen catabolite repression. We hypothesize that they are also involved in the repression of Aro80 expression by physically binding to the region near BS_Aro ₈₀, thus making it inaccessible to Aro80 binding. This in turn breaks the positive feedback loop and represses the expression of Aro80 itself and other Aro80 regulated genes.

**Figure 5. Comparison between HG and mHG Enrichment**
The mHG and HG methods were applied to ChIP–chip data of six TFs. The sequences were ranked according to the ChIP–chip binding signal, and the enrichment of the correct binding motif was recorded using mHG and HG with fixed target sets containing the top 10, 100, and 1,000 sequences as well as all sequences with ChIP–chip signal <10⁻³. All scores were corrected for multiple motif testing. The mHG score is also corrected for the multiple cutoff testing. The 10⁻³ and mHG cutoffs for each experiment are shown. It can be seen that the two cutoffs are significantly different and that for all the tested TFs mHG produces better results than HG in terms of enrichment of the true motif.

**Figure 6. Comparison of the Target Sets Sizes as Determined by the Fixed versus the mHG Flexible Cutoffs**
Each dot represents a ChIP–chip experiment where the x and y coordinates are the number of promoters with p < 10⁻ (standard cutoff) and the number of promoters as determined by the mHG cutoff, respectively. The dotted line is x = y. TF names are given in Table S4.

**Figure 7. Motif Occurrences in the Top 59 (of ∼6,000) Promoters That Were Ranked According to Met32 Binding Signal**
A comparison is made between the data-driven mHG cutoff and the arbitrary fixed cutoff. It can be seen that the motifs are significantly more enriched when the list is partitioned using the mHG cutoff.

**Figure 8. Two-Dimensional Grid Used for Calculating mHG p-Value**
In this example N = 20, B = 10, p = 0.1. Light-shaded area describes all attainable values of n and b. Dark-shaded area describes the subset R: all values of n and b for which HGT(b;N,B,n) ≤ p. Two (0,0) → (N,B) paths are depicted, representing the binary label vectors λ ₁ = {1,1,1,0,1,0,1,1,1,0,1,0,.0,0,0,1,0,1,0,0} and λ ₂ = {0,0,0,1,0,1,1,1,0,0,0,1,1,0,1,0,0,1,1,1}. The path λ₁ traverses R, demonstrating that mHG (λ ₁) ≤ p. The path λ₂ does not traverse R, demonstrating that mHG (λ ₁) > p.

**Figure 9. The Distribution of TFBS Occurrence Multiplicities per Intergenic Region in S. cerevisiae Is Shown for Five TFs Whose TFBS Motif Was Experimentally Verified**
Note that the y-axis is logarithmic. It can be seen that in most instances the TFBS appears in either zero, one, or two copies per intergenic region.

See this image and copyright information in PMC

Cited by

Transcriptome sequencing supports a conservation of macrophage polarization in fish.
Wentzel AS, Petit J, van Veen WG, Fink IR, Scheer MH, Piazzon MC, Forlenza M, Spaink HP, Wiegertjes GF. Wentzel AS, et al. Sci Rep. 2020 Aug 10;10(1):13470. doi: 10.1038/s41598-020-70248-y. Sci Rep. 2020. PMID: 32778701 Free PMC article.
Conservation of Regional Variation in Sex-Specific Sex Chromosome Regulation.
Wright AE, Zimmer F, Harrison PW, Mank JE. Wright AE, et al. Genetics. 2015 Oct;201(2):587-98. doi: 10.1534/genetics.115.179234. Epub 2015 Aug 5. Genetics. 2015. PMID: 26245831 Free PMC article.
pH-Gated Succinate Secretion Regulates Muscle Remodeling in Response to Exercise.
Reddy A, Bozi LHM, Yaghi OK, Mills EL, Xiao H, Nicholson HE, Paschini M, Paulo JA, Garrity R, Laznik-Bogoslavski D, Ferreira JCB, Carl CS, Sjøberg KA, Wojtaszewski JFP, Jeppesen JF, Kiens B, Gygi SP, Richter EA, Mathis D, Chouchani ET. Reddy A, et al. Cell. 2020 Oct 1;183(1):62-75.e17. doi: 10.1016/j.cell.2020.08.039. Epub 2020 Sep 17. Cell. 2020. PMID: 32946811 Free PMC article.
Blood pressure regulation by CD4⁺ lymphocytes expressing choline acetyltransferase.
Olofsson PS, Steinberg BE, Sobbi R, Cox MA, Ahmed MN, Oswald M, Szekeres F, Hanes WM, Introini A, Liu SF, Holodick NE, Rothstein TL, Lövdahl C, Chavan SS, Yang H, Pavlov VA, Broliden K, Andersson U, Diamond B, Miller EJ, Arner A, Gregersen PK, Backx PH, Mak TW, Tracey KJ. Olofsson PS, et al. Nat Biotechnol. 2016 Oct;34(10):1066-1071. doi: 10.1038/nbt.3663. Epub 2016 Sep 12. Nat Biotechnol. 2016. PMID: 27617738 Free PMC article.
A structural-based statistical approach suggests a cooperative activity of PUM1 and miR-410 in human 3'-untranslated regions.
Leibovich L, Mandel-Gutfreund Y, Yakhini Z. Leibovich L, et al. Silence. 2010 Sep 22;1(1):17. doi: 10.1186/1758-907X-1-17. Silence. 2010. PMID: 20860814 Free PMC article.

See all "Cited by" articles

References

1. Ren B, Robert F, Wyrick J, Aparicio O, Jennings E, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290:2306–2309. - PubMed
1. Keshet I, Schlesinger Y, Farkash S, Rand E, Hecht M, et al. Evidence for an instructive mechanism of de novo methylation in cancer cells. Nat Genet. 2006;38:149–153. - PubMed
1. Bussemaker H, Li H, Siggia E. Regulatory element detection using correlation with expression. Nat Genet. 2001;27:167–71. - PubMed
1. Sinha S, Tompa M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002;30:5549–5560. - PMC - PubMed
1. Sinha S, Tompa M. Ymf: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003;31:3586–3588. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database
Molecular Biology Databases
- Saccharomyces Genome Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Discovering motifs in ranked lists of DNA sequences

Affiliation

Discovering motifs in ranked lists of DNA sequences

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous