Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins
- PMID: 10584071
- DOI: 10.1002/(sici)1097-0134(19991101)37:2<264::aid-prot11>3.0.co;2-c
Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins
Abstract
Using Teiresias, a pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored the GenPept sequence database and built a dictionary of all sequence patterns with two or more instances. The entries of this dictionary, henceforth named seqlets, cover 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence space. As such, seqlets can be effectively used to describe almost every naturally occurring protein. In fact, seqlets can be thought of as building blocks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships. Thus, seqlets can either define conserved family signatures or cut across molecular families and previously undetected sequence signals deriving from functional convergence. Moreover, we show that seqlets also can capture structurally conserved motifs. The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for addressing problems that range from reliable classification and the correlation of sequence fragments with functional categories to faster and sensitive engines for homology searches, evolutionary studies, and protein structure prediction.
Similar articles
-
Building dictionaries of 1D and 3D motifs by mining the Unaligned 1D sequences of 17 archaeal and bacterial genomes.Proc Int Conf Intell Syst Mol Biol. 1999:223-33. Proc Int Conf Intell Syst Mol Biol. 1999. PMID: 10786305
-
A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3.Nucleic Acids Res. 2001 Jan 1;29(1):55-7. doi: 10.1093/nar/29.1.55. Nucleic Acids Res. 2001. PMID: 11125048 Free PMC article.
-
Dictionary-driven protein annotation.Nucleic Acids Res. 2002 Sep 1;30(17):3901-16. doi: 10.1093/nar/gkf464. Nucleic Acids Res. 2002. PMID: 12202776 Free PMC article.
-
Protein sequence motifs.Curr Opin Struct Biol. 1996 Jun;6(3):366-76. doi: 10.1016/s0959-440x(96)80057-1. Curr Opin Struct Biol. 1996. PMID: 8804823 Review.
-
The quest to deduce protein function from sequence: the role of pattern databases.Int J Biochem Cell Biol. 2000 Feb;32(2):139-55. doi: 10.1016/s1357-2725(99)00106-5. Int J Biochem Cell Biol. 2000. PMID: 10687950 Review.
Cited by
-
Motif-based fold assignment.Protein Sci. 2001 Dec;10(12):2460-9. doi: 10.1110/ps.14401. Protein Sci. 2001. PMID: 11714913 Free PMC article.
-
Functional representation of enzymes by specific peptides.PLoS Comput Biol. 2007 Aug;3(8):e167. doi: 10.1371/journal.pcbi.0030167. Epub 2007 Jul 11. PLoS Comput Biol. 2007. PMID: 17722976 Free PMC article.
-
The determinants of the rarity of nucleic and peptide short sequences in nature.NAR Genom Bioinform. 2024 Apr 4;6(2):lqae029. doi: 10.1093/nargab/lqae029. eCollection 2024 Jun. NAR Genom Bioinform. 2024. PMID: 38584871 Free PMC article.
-
The web server of IBM's Bioinformatics and Pattern Discovery group.Nucleic Acids Res. 2003 Jul 1;31(13):3645-50. doi: 10.1093/nar/gkg621. Nucleic Acids Res. 2003. PMID: 12824385 Free PMC article.
-
TRILOGY: Discovery of sequence-structure patterns across diverse proteins.Proc Natl Acad Sci U S A. 2002 Jun 25;99(13):8500-5. doi: 10.1073/pnas.112221999. Proc Natl Acad Sci U S A. 2002. PMID: 12084910 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials