Profile-based string kernels for remote homology detection and motif extraction
- PMID: 16448009
- DOI: 10.1109/csb.2004.1332428
Profile-based string kernels for remote homology detection and motif extraction
Abstract
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" -- short regions of the original profile that contribute almost all the weight of the SVM classification score -- and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.
Similar articles
-
Profile-based string kernels for remote homology detection and motif extraction.J Bioinform Comput Biol. 2005 Jun;3(3):527-50. doi: 10.1142/s021972000500120x. J Bioinform Comput Biol. 2005. PMID: 16108083
-
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2. BMC Bioinformatics. 2007. PMID: 17570145 Free PMC article.
-
Mismatch string kernels for discriminative protein classification.Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22. Bioinformatics. 2004. PMID: 14990442
-
Protein homology detection using string alignment kernels.Bioinformatics. 2004 Jul 22;20(11):1682-9. doi: 10.1093/bioinformatics/bth141. Epub 2004 Feb 26. Bioinformatics. 2004. PMID: 14988126
-
Biological applications of support vector machines.Brief Bioinform. 2004 Dec;5(4):328-38. doi: 10.1093/bib/5.4.328. Brief Bioinform. 2004. PMID: 15606969 Review.
Cited by
-
Machine learning for in silico virtual screening and chemical genomics: new strategies.Comb Chem High Throughput Screen. 2008 Sep;11(8):677-85. doi: 10.2174/138620708785739899. Comb Chem High Throughput Screen. 2008. PMID: 18795887 Free PMC article. Review.
-
Building blocks and blueprints for bacterial autolysins.PLoS Comput Biol. 2021 Apr 1;17(4):e1008889. doi: 10.1371/journal.pcbi.1008889. eCollection 2021 Apr. PLoS Comput Biol. 2021. PMID: 33793553 Free PMC article.
-
MiRTif: a support vector machine-based microRNA target interaction filter.BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S4. doi: 10.1186/1471-2105-9-S12-S4. BMC Bioinformatics. 2008. PMID: 19091027 Free PMC article.
-
Protein ranking by semi-supervised network propagation.BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-7-S1-S10. BMC Bioinformatics. 2006. PMID: 16723003 Free PMC article.
-
Efficient use of unlabeled data for protein sequence classification: a comparative study.BMC Bioinformatics. 2009 Apr 29;10 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-10-S4-S2. BMC Bioinformatics. 2009. PMID: 19426450 Free PMC article.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Research Materials