Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;23(1):181-94.
doi: 10.1101/gr.139881.112. Epub 2012 Sep 18.

P-value-based regulatory motif discovery using positional weight matrices

Affiliations

P-value-based regulatory motif discovery using positional weight matrices

Holger Hartmann et al. Genome Res. 2013 Jan.

Abstract

To analyze gene regulatory networks, the sequence-dependent DNA/RNA binding affinities of proteins and noncoding RNAs are crucial. Often, these are deduced from sets of sequences enriched in factor binding sites. Two classes of computational approaches exist. The first describe binding motifs by sequence patterns and search the patterns with highest statistical significance for enrichment. The second class uses the more powerful position weight matrices (PWMs). Instead of maximizing the statistical significance of enrichment, they maximize a likelihood. Here we present XXmotif (eXhaustive evaluation of matriX motifs), the first PWM-based motif discovery method that can optimize PWMs by directly minimizing their P-values of enrichment. Optimization requires computing millions of enrichment P-values for thousands of PWMs. For a given PWM, the enrichment P-value is calculated efficiently from the match P-values of all possible motif placements in the input sequences using order statistics. The approach can naturally combine P-values for motif enrichment, conservation, and localization. On ChIP-chip/seq, miRNA knock-down, and coexpression data sets from yeast and metazoans, XXmotif outperformed state-of-the-art tools, both in numbers of correctly identified motifs and in the quality of PWMs. In segmentation modules of D. melanogaster, we detect the known key regulators and several new motifs. In human core promoters, XXmotif reports most previously described and eight novel motifs sharply peaked around the transcription start site, among them an Initiator motif similar to the fly and yeast versions. XXmotif's sensitivity, reliability, and usability will help to leverage the quickly accumulating wealth of functional genomics data.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of XXmotif with its three main stages. After an optional step to mask confounding sequence regions (blue), enrichment P-values of all 5-mers and gapped palindromic and tandemic 6-mer seed patterns are evaluated, and the best seeds are recursively extended by an optional gap and a motif position (red). Patterns are converted to PWMs and fed to the PWM stage (green). Here, similar PWMs are merged and then iteratively refined by optimizing the motif enrichment E-value. Finally, merging and refinement stages are iterated until convergence.
Figure 2.
Figure 2.
Sensitivity of motif discovery tools on yeast ChIP-chip data. Shown is the number of correctly predicted transcription factor binding motifs within the top 1 (A) or top 4 predictions (B). Predictions are based on ChIP-enriched intergenic regions from 352 ChIP-chip experiments (Harbison et al. 2004). Three experimental reference sets are used to judge the correctness of motifs (red, green, blue). The dashed line separates the general-purpose motif discovery tools from ERMIT, which needs ChIP enrichment P-values. In the tool names, M indicates a fifth-order Markov model, C the use of conservation, and D the discriminative prior from the Hartemink lab (Gordân et al. 2010). XXmotif-noref and XXmotif-5-noref omit the PWM refinement and the latter version uses only 5-mer seeds.
Figure 3.
Figure 3.
Reference-free PWM quality assessment on yeast ChIP-chip data. The curves quantify how well the scores of the reported PWMs can predict the ChIP enrichment of the sequences. Intergenic regions are ranked by their maximum PWM score. For each predicted PWM, a ROC curve with the number of correct predictions over the number of false predictions is computed, and the partial area under the 5% best-ranked false predictions of the ROC curve (pAUC) is calculated. The plots show the cumulative distributions of pAUC values (A,B) for all 247 ChIP-chip data sets that had at least 10 significantly enriched regions (P-value < 0.001). Regions with a ChIP enrichment P-value of <0.001 are defined as correct predictions, all other regions as false predictions. (C,D) Same as A and B but using a subset of 151 high-quality data sets. For “TOP 4,” the best of the top 4 reported motifs is evaluated. The average pAUC scores are listed in the figure legends.
Figure 4.
Figure 4.
Top 4 benchmark results on 24 target sets for transcription factors from human, mouse, worm, and fly, as well as 10 target sets for microRNAs from human, and mouse from the metazoan target set compendium (Linhart et al. 2008). The plot is adapted from Linhart et al. (2008): The “Source” column indicates the experimental procedure or database from which the target set was derived: Gene-expression microarrays (Expr), ChIP-chip (CC), ChIP-DSL (C-DSL), DamID (van Steensel et al. 2001), or Gene Ontology (GO) database (Ashburner et al. 2000). The black and gray boxes indicate the similarity of the predicted PWM to the reference motif in TRANSFAC or miRBase. Darker shades indicate closer similarity. “Set Size”: number of sequences within the input set.
Figure 5.
Figure 5.
Motifs discovered in cis-regulatory modules for fly segmentation. The table lists all motifs that XXmotif reports up to an E-value of 0.5 on 54 segmentation modules responsible for patterning the anterior-posterior (AP) axis during early embryogenesis. To score conservation, multiple sequence alignments of D. melanogaster, 11 other Drosophila species, and Anopheles gambiae were supplied as input. For 18 of the 28 predicted motifs, similar literature motifs of transcription factors known to be involved in AP axis segmentation were assigned by TOMTOM (Gupta et al. 2007) or by visual inspection. Nine of the predicted motifs may describe transcription factors representing missing nodes in the transcriptional network.
Figure 6.
Figure 6.
Human core promoter motifs discovered by XXmotif. (A) List of motifs up to an E-value of 0.1 in a set of 1871 human core promoter regions (−300 bp to +100 bp around TSS) from the eukaryotic promoter database (EPD) (Schmid et al. 2006). For 20 of the 39 predicted motifs, similar literature motifs were assigned by TOMTOM (Gupta et al. 2007) or us (last two columns). The motif at position 18, which was originally named Initiator (Xi et al. 2007), is actually the reverse complement of YY1. Eight novel, highly significant motifs, designated XX1 to XX6, XX1(rev), and XX3(rev), show positional distribution peaks near the TSS. XX6 is the canonical Initiator motif similar to elements found in D. melanogaster and S. cerevisiae. Ten motifs with a broad positional distribution are not shown. The positional distributions of the PWMs were obtained by scanning the PWMs over a larger region (−1000 bp to +500 bp) around the TSS. (B) Top eight motifs obtained with the core promoter sequences of the 65 genes annotated as coding for ribosomal proteins in EPD (Xi et al. 2007; FitzGerald et al. 2006; Parry et al. 2010).

Similar articles

Cited by

References

    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29 - PMC - PubMed
    1. Badis G, Chan ET, van Bakel H, Pena-Castillo L, Tillo D, Tsui K, Carlson CD, Gossett AJ, Hasinoff MJ, Warren CL, et al. 2008. A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol Cell 32: 878–887 - PMC - PubMed
    1. Bailey TL, Elkan C 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36 - PubMed
    1. Bailey TL, Bodén M, Whitington T, Machanick P 2010. The value of position-specific priors in motif discovery using MEME. BMC Bioinformatics 11: 179 doi: 10.1186/1471-2105-11-179 - PMC - PubMed
    1. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708–715 - PMC - PubMed

Publication types

LinkOut - more resources