. 2013 Jan;23(1):181-94.

doi: 10.1101/gr.139881.112. Epub 2012 Sep 18.

P-value-based regulatory motif discovery using positional weight matrices

Holger Hartmann¹, Eckhart W Guthöhrlein, Matthias Siebert, Sebastian Luehr, Johannes Söding

Affiliations

PMID: 22990209
PMCID: PMC3530678
DOI: 10.1101/gr.139881.112

P-value-based regulatory motif discovery using positional weight matrices

Holger Hartmann et al. Genome Res. 2013 Jan.

. 2013 Jan;23(1):181-94.

doi: 10.1101/gr.139881.112. Epub 2012 Sep 18.

Authors

Holger Hartmann¹, Eckhart W Guthöhrlein, Matthias Siebert, Sebastian Luehr, Johannes Söding

Affiliation

¹ Gene Center and Department of Biochemistry, Ludwig-Maximilians-Universität München, Feodor-Lynen-Straße 25, 81377 Munich, Germany.

PMID: 22990209
PMCID: PMC3530678
DOI: 10.1101/gr.139881.112

Abstract

To analyze gene regulatory networks, the sequence-dependent DNA/RNA binding affinities of proteins and noncoding RNAs are crucial. Often, these are deduced from sets of sequences enriched in factor binding sites. Two classes of computational approaches exist. The first describe binding motifs by sequence patterns and search the patterns with highest statistical significance for enrichment. The second class uses the more powerful position weight matrices (PWMs). Instead of maximizing the statistical significance of enrichment, they maximize a likelihood. Here we present XXmotif (eXhaustive evaluation of matriX motifs), the first PWM-based motif discovery method that can optimize PWMs by directly minimizing their P-values of enrichment. Optimization requires computing millions of enrichment P-values for thousands of PWMs. For a given PWM, the enrichment P-value is calculated efficiently from the match P-values of all possible motif placements in the input sequences using order statistics. The approach can naturally combine P-values for motif enrichment, conservation, and localization. On ChIP-chip/seq, miRNA knock-down, and coexpression data sets from yeast and metazoans, XXmotif outperformed state-of-the-art tools, both in numbers of correctly identified motifs and in the quality of PWMs. In segmentation modules of D. melanogaster, we detect the known key regulators and several new motifs. In human core promoters, XXmotif reports most previously described and eight novel motifs sharply peaked around the transcription start site, among them an Initiator motif similar to the fly and yeast versions. XXmotif's sensitivity, reliability, and usability will help to leverage the quickly accumulating wealth of functional genomics data.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of XXmotif with its three main stages. After an optional step to mask confounding sequence regions (blue), enrichment P-values of all 5-mers and gapped palindromic and tandemic 6-mer seed patterns are evaluated, and the best seeds are recursively extended by an optional gap and a motif position (red). Patterns are converted to PWMs and fed to the PWM stage (green). Here, similar PWMs are merged and then iteratively refined by optimizing the motif enrichment E-value. Finally, merging and refinement stages are iterated until convergence.

**Figure 2.**
Sensitivity of motif discovery tools on yeast ChIP-chip data. Shown is the number of correctly predicted transcription factor binding motifs within the top 1 (A) or top 4 predictions (B). Predictions are based on ChIP-enriched intergenic regions from 352 ChIP-chip experiments (Harbison et al. 2004). Three experimental reference sets are used to judge the correctness of motifs (red, green, blue). The dashed line separates the general-purpose motif discovery tools from ERMIT, which needs ChIP enrichment P-values. In the tool names, M indicates a fifth-order Markov model, C the use of conservation, and D the discriminative prior from the Hartemink lab (Gordân et al. 2010). XXmotif-noref and XXmotif-5-noref omit the PWM refinement and the latter version uses only 5-mer seeds.

**Figure 3.**
Reference-free PWM quality assessment on yeast ChIP-chip data. The curves quantify how well the scores of the reported PWMs can predict the ChIP enrichment of the sequences. Intergenic regions are ranked by their maximum PWM score. For each predicted PWM, a ROC curve with the number of correct predictions over the number of false predictions is computed, and the partial area under the 5% best-ranked false predictions of the ROC curve (pAUC) is calculated. The plots show the cumulative distributions of pAUC values (A,B) for all 247 ChIP-chip data sets that had at least 10 significantly enriched regions (P-value < 0.001). Regions with a ChIP enrichment P-value of <0.001 are defined as correct predictions, all other regions as false predictions. (C,D) Same as A and B but using a subset of 151 high-quality data sets. For “TOP 4,” the best of the top 4 reported motifs is evaluated. The average pAUC scores are listed in the figure legends.

**Figure 4.**
Top 4 benchmark results on 24 target sets for transcription factors from human, mouse, worm, and fly, as well as 10 target sets for microRNAs from human, and mouse from the metazoan target set compendium (Linhart et al. 2008). The plot is adapted from Linhart et al. (2008): The “Source” column indicates the experimental procedure or database from which the target set was derived: Gene-expression microarrays (Expr), ChIP-chip (CC), ChIP-DSL (C-DSL), DamID (van Steensel et al. 2001), or Gene Ontology (GO) database (Ashburner et al. 2000). The black and gray boxes indicate the similarity of the predicted PWM to the reference motif in TRANSFAC or miRBase. Darker shades indicate closer similarity. “Set Size”: number of sequences within the input set.

**Figure 5.**
Motifs discovered in *cis*-regulatory modules for fly segmentation. The table lists all motifs that XXmotif reports up to an E-value of 0.5 on 54 segmentation modules responsible for patterning the anterior-posterior (AP) axis during early embryogenesis. To score conservation, multiple sequence alignments of *D. melanogaster*, 11 other *Drosophila* species, and *Anopheles gambiae* were supplied as input. For 18 of the 28 predicted motifs, similar literature motifs of transcription factors known to be involved in AP axis segmentation were assigned by TOMTOM (Gupta et al. 2007) or by visual inspection. Nine of the predicted motifs may describe transcription factors representing missing nodes in the transcriptional network.

**Figure 6.**
Human core promoter motifs discovered by XXmotif. (A) List of motifs up to an E-value of 0.1 in a set of 1871 human core promoter regions (−300 bp to +100 bp around TSS) from the eukaryotic promoter database (EPD) (Schmid et al. 2006). For 20 of the 39 predicted motifs, similar literature motifs were assigned by TOMTOM (Gupta et al. 2007) or us (last two columns). The motif at position 18, which was originally named Initiator (Xi et al. 2007), is actually the reverse complement of YY1. Eight novel, highly significant motifs, designated XX1 to XX6, XX1(rev), and XX3(rev), show positional distribution peaks near the TSS. XX6 is the canonical Initiator motif similar to elements found in *D. melanogaster* and *S. cerevisiae*. Ten motifs with a broad positional distribution are not shown. The positional distributions of the PWMs were obtained by scanning the PWMs over a larger region (−1000 bp to +500 bp) around the TSS. (B) Top eight motifs obtained with the core promoter sequences of the 65 genes annotated as coding for ribosomal proteins in EPD (Xi et al. 2007; FitzGerald et al. 2006; Parry et al. 2010).

See this image and copyright information in PMC

Cited by

ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery.
Li Y, Ni P, Zhang S, Li G, Su Z. Li Y, et al. Bioinformatics. 2019 Nov 1;35(22):4632-4639. doi: 10.1093/bioinformatics/btz290. Bioinformatics. 2019. PMID: 31070745 Free PMC article.
Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences.
Siebert M, Söding J. Siebert M, et al. Nucleic Acids Res. 2016 Jul 27;44(13):6055-69. doi: 10.1093/nar/gkw521. Epub 2016 Jun 9. Nucleic Acids Res. 2016. PMID: 27288444 Free PMC article.
Structural remodeling of AAA+ ATPase p97 by adaptor protein ASPL facilitates posttranslational methylation by METTL21D.
Petrović S, Roske Y, Rami B, Phan MHQ, Panáková D, Heinemann U. Petrović S, et al. Proc Natl Acad Sci U S A. 2023 Jan 24;120(4):e2208941120. doi: 10.1073/pnas.2208941120. Epub 2023 Jan 19. Proc Natl Acad Sci U S A. 2023. PMID: 36656859 Free PMC article.
Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data.
Zhang S, Ma A, Zhao J, Xu D, Ma Q, Wang Y. Zhang S, et al. Brief Bioinform. 2022 Jan 17;23(1):bbab374. doi: 10.1093/bib/bbab374. Brief Bioinform. 2022. PMID: 34607350 Free PMC article.
The APT complex is involved in non-coding RNA transcription and is distinct from CPF.
Lidschreiber M, Easter AD, Battaglia S, Rodríguez-Molina JB, Casañal A, Carminati M, Baejen C, Grzechnik P, Maier KC, Cramer P, Passmore LA. Lidschreiber M, et al. Nucleic Acids Res. 2018 Nov 30;46(21):11528-11538. doi: 10.1093/nar/gky845. Nucleic Acids Res. 2018. PMID: 30247719 Free PMC article.

See all "Cited by" articles

References

1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29 - PMC - PubMed
1. Badis G, Chan ET, van Bakel H, Pena-Castillo L, Tillo D, Tsui K, Carlson CD, Gossett AJ, Hasinoff MJ, Warren CL, et al. 2008. A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol Cell 32: 878–887 - PMC - PubMed
1. Bailey TL, Elkan C 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36 - PubMed
1. Bailey TL, Bodén M, Whitington T, Machanick P 2010. The value of position-specific priors in motif discovery using MEME. BMC Bioinformatics 11: 179 doi: 10.1186/1471-2105-11-179 - PMC - PubMed
1. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708–715 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

P-value-based regulatory motif discovery using positional weight matrices

Affiliation

P-value-based regulatory motif discovery using positional weight matrices

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases