Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins

I Rigoutsos¹, A Floratos, C Ouzounis, Y Gao, L Parida

Affiliations

PMID: 10584071
DOI: 10.1002/(sici)1097-0134(19991101)37:2<264::aid-prot11>3.0.co;2-c

Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins

I Rigoutsos et al. Proteins. 1999.

. 1999 Nov 1;37(2):264-77.

doi: 10.1002/(sici)1097-0134(19991101)37:2<264::aid-prot11>3.0.co;2-c.

Authors

I Rigoutsos¹, A Floratos, C Ouzounis, Y Gao, L Parida

Affiliation

¹ Computational Biology Center, Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA. rigoutso@us.ibm.com

PMID: 10584071
DOI: 10.1002/(sici)1097-0134(19991101)37:2<264::aid-prot11>3.0.co;2-c

Abstract

Using Teiresias, a pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored the GenPept sequence database and built a dictionary of all sequence patterns with two or more instances. The entries of this dictionary, henceforth named seqlets, cover 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence space. As such, seqlets can be effectively used to describe almost every naturally occurring protein. In fact, seqlets can be thought of as building blocks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships. Thus, seqlets can either define conserved family signatures or cut across molecular families and previously undetected sequence signals deriving from functional convergence. Moreover, we show that seqlets also can capture structurally conserved motifs. The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for addressing problems that range from reliable classification and the correlation of sequence fragments with functional categories to faster and sensitive engines for homology searches, evolutionary studies, and protein structure prediction.

PubMed Disclaimer

Cited by

Motif-based fold assignment.
Salwinski L, Eisenberg D. Salwinski L, et al. Protein Sci. 2001 Dec;10(12):2460-9. doi: 10.1110/ps.14401. Protein Sci. 2001. PMID: 11714913 Free PMC article.
Functional representation of enzymes by specific peptides.
Kunik V, Meroz Y, Solan Z, Sandbank B, Weingart U, Ruppin E, Horn D. Kunik V, et al. PLoS Comput Biol. 2007 Aug;3(8):e167. doi: 10.1371/journal.pcbi.0030167. Epub 2007 Jul 11. PLoS Comput Biol. 2007. PMID: 17722976 Free PMC article.
The determinants of the rarity of nucleic and peptide short sequences in nature.
Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. Chantzi N, et al. NAR Genom Bioinform. 2024 Apr 4;6(2):lqae029. doi: 10.1093/nargab/lqae029. eCollection 2024 Jun. NAR Genom Bioinform. 2024. PMID: 38584871 Free PMC article.
The web server of IBM's Bioinformatics and Pattern Discovery group.
Huynh T, Rigoutsos I, Parida L, Platt D, Shibuya T. Huynh T, et al. Nucleic Acids Res. 2003 Jul 1;31(13):3645-50. doi: 10.1093/nar/gkg621. Nucleic Acids Res. 2003. PMID: 12824385 Free PMC article.
TRILOGY: Discovery of sequence-structure patterns across diverse proteins.
Bradley P, Kim PS, Berger B. Bradley P, et al. Proc Natl Acad Sci U S A. 2002 Jun 25;99(13):8500-5. doi: 10.1073/pnas.112221999. Proc Natl Acad Sci U S A. 2002. PMID: 12084910 Free PMC article.

See all "Cited by" articles

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Wiley
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins

Affiliation

Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins

Authors

Affiliation

Abstract

Similar articles

Cited by

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials