A generic motif discovery algorithm for sequential data
- PMID: 16257985
- DOI: 10.1093/bioinformatics/bti745
A generic motif discovery algorithm for sequential data
Abstract
Motivation: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems.
Results: Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures.
Availability: Gemoda is freely available at http://web.mit.edu/bamel/gemoda
Similar articles
-
WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches.Nucleic Acids Res. 2007 Jul;35(Web Server issue):W217-20. doi: 10.1093/nar/gkm376. Epub 2007 Jun 21. Nucleic Acids Res. 2007. PMID: 17584794 Free PMC article.
-
MUSA: a parameter free algorithm for the identification of biologically significant motifs.Bioinformatics. 2006 Dec 15;22(24):2996-3002. doi: 10.1093/bioinformatics/btl537. Epub 2006 Oct 26. Bioinformatics. 2006. PMID: 17068086
-
Predicting functional sites with an automated algorithm suitable for heterogeneous datasets.BMC Bioinformatics. 2005 May 13;6:116. doi: 10.1186/1471-2105-6-116. BMC Bioinformatics. 2005. PMID: 15890082 Free PMC article.
-
Discovering sequence motifs.Methods Mol Biol. 2008;452:231-51. doi: 10.1007/978-1-60327-159-2_12. Methods Mol Biol. 2008. PMID: 18566768 Review.
-
An extension and novel solution to the (l,d)-motif challenge problem.Genome Inform. 2004;15(2):63-71. Genome Inform. 2004. PMID: 15706492 Review.
Cited by
-
Real-Time PCR: Revolutionizing Detection and Expression Analysis of Genes.Curr Genomics. 2007 Jun;8(4):234-51. doi: 10.2174/138920207781386960. Curr Genomics. 2007. PMID: 18645596 Free PMC article.
-
Efficient motif search in ranked lists and applications to variable gap motifs.Nucleic Acids Res. 2012 Jul;40(13):5832-47. doi: 10.1093/nar/gks206. Epub 2012 Mar 13. Nucleic Acids Res. 2012. PMID: 22416066 Free PMC article.
-
Navigating freely-available software tools for metabolomics analysis.Metabolomics. 2017;13(9):106. doi: 10.1007/s11306-017-1242-7. Epub 2017 Aug 9. Metabolomics. 2017. PMID: 28890673 Free PMC article. Review.
-
iTriplet, a rule-based nucleic acid sequence motif finder.Algorithms Mol Biol. 2009 Oct 29;4:14. doi: 10.1186/1748-7188-4-14. Algorithms Mol Biol. 2009. PMID: 19874606 Free PMC article.
-
Comparative analysis of regulatory motif discovery tools for transcription factor binding sites.Genomics Proteomics Bioinformatics. 2007 May;5(2):131-42. doi: 10.1016/S1672-0229(07)60023-0. Genomics Proteomics Bioinformatics. 2007. PMID: 17893078 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials