An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences
- PMID: 17237070
- DOI: 10.1093/bioinformatics/btl665
An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences
Abstract
Motivation: Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets.
Results: In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets.
Availability: The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
Similar articles
-
Mining sequential patterns for protein fold recognition.J Biomed Inform. 2008 Feb;41(1):165-79. doi: 10.1016/j.jbi.2007.05.004. Epub 2007 May 17. J Biomed Inform. 2008. PMID: 17573243
-
SIMAP--the similarity matrix of proteins.Bioinformatics. 2005 Sep 1;21 Suppl 2:ii42-6. doi: 10.1093/bioinformatics/bti1107. Bioinformatics. 2005. PMID: 16204123
-
Identification of putative domain linkers by a neural network - application to a large sequence database.BMC Bioinformatics. 2006 Jun 27;7:323. doi: 10.1186/1471-2105-7-323. BMC Bioinformatics. 2006. PMID: 16800897 Free PMC article.
-
Protein arrays and pattern recognition: new tools to assist in the identification and management of autoimmune disease.Autoimmun Rev. 2006 Apr;5(4):234-41. doi: 10.1016/j.autrev.2005.07.007. Epub 2005 Aug 25. Autoimmun Rev. 2006. PMID: 16697963 Review.
-
Microcomputer-assisted periodic pattern recognition in the primary structure of proteins.Biomed Biochim Acta. 1990;49(8-9):951-62. Biomed Biochim Acta. 1990. PMID: 2082933 Review.
Cited by
-
Systematic discovery of complex insertions and deletions in human cancers.Nat Med. 2016 Jan;22(1):97-104. doi: 10.1038/nm.4002. Epub 2015 Dec 14. Nat Med. 2016. PMID: 26657142 Free PMC article.
-
Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors.BMC Bioinformatics. 2015 Sep 29;16:314. doi: 10.1186/s12859-015-0731-9. BMC Bioinformatics. 2015. PMID: 26415951 Free PMC article.
-
PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data.Bioinformatics. 2012 Feb 15;28(4):479-86. doi: 10.1093/bioinformatics/btr712. Epub 2012 Jan 4. Bioinformatics. 2012. PMID: 22219203 Free PMC article.
-
Analysis of next-generation genomic data in cancer: accomplishments and challenges.Hum Mol Genet. 2010 Oct 15;19(R2):R188-96. doi: 10.1093/hmg/ddq391. Epub 2010 Sep 15. Hum Mol Genet. 2010. PMID: 20843826 Free PMC article. Review.
-
Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.Bioinformatics. 2009 Nov 1;25(21):2865-71. doi: 10.1093/bioinformatics/btp394. Epub 2009 Jun 26. Bioinformatics. 2009. PMID: 19561018 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources