Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 Mar 10;33(5):1445-53.
doi: 10.1093/nar/gki282. Print 2005.

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence

Affiliations
Comparative Study

NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence

Thomas A Down et al. Nucleic Acids Res. .

Abstract

NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The zero-or-one occurrences per sequence (ZOOPS) sequence mixture model (SMM), represented as a hidden Markov model (HMM). The states labelled m1–m4 are responsible for modelling the interesting motif, while the other states model the non-interesting remainder of the sequence.
Figure 2
Figure 2
A multiple-uncounted SMM containing two motifs. The black dots are silent states, which are not responsible for modelling any part of the sequence.
Figure 3
Figure 3
Likelihoods of a set of test sequences, given mosaic background models of various orders and class numbers.
Figure 4
Figure 4
(a) The original HLF motif from JASPAR. (b) Results for searching for HLF in a set of 150 base sequences using MEME. (c) MEME with 200 base sequences. (d) NestedMICA with 600 base sequences. (e) NestedMICA with 700 base sequences.
Figure 5
Figure 5
A selection of mammalian JASPAR weight matrices that are used for synthetic data tests.
Figure 6
Figure 6
ROC curves for the best matches to the SRE sites in the NestedMICA and MEME results.
Figure 7
Figure 7
The MEF2 motif derived from curated sites, and the corresponding high-scoring motifs from NestedMICA and MEME.
Figure 8
Figure 8
The SRE motif derived from curated sites, and the corresponding high-scoring motifs from NestedMICA and MEME.

Similar articles

Cited by

References

    1. Stormo G.D., Schneider T.D., Gold L., Ehrenfeucht A. Use of the ‘perceptron’ algorithm to distinguish translational initiation sites in E.coli. Nucleic Acids Res. 1982;10:2997–3011. - PMC - PubMed
    1. Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase ii promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 1990;212:563–578. - PubMed
    1. Marsan L., Sagot M.F. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 2000;7:345–362. - PubMed
    1. Vilo J., Brazma A., Jonassen I., Robinson A., Ukonnen E. Mining for putative regulatory elements in the yeast genome using gene expression data. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology; San Diego, CA: AAAI Press; 2000. pp. 384–394. - PubMed
    1. Barash Y., Elidan G., Friedman N., Kaplan T. Modelling dependencies in protein–DNA binding sites. Proceedings of Seventh Annual International Conference on Computational Molecular Biology (RECOMB); New York, NY: ACM Press; 2003. pp. 28–37.

Publication types

Substances