High performance set of PseAAC and sequence based descriptors for protein classification
- PMID: 20558184
- DOI: 10.1016/j.jtbi.2010.06.006
High performance set of PseAAC and sequence based descriptors for protein classification
Abstract
The study of reliable automatic systems for protein classification is important for several domains, including finding novel drugs and vaccines. The last decade has seen a number of advances in the development of reliable systems for classifying proteins. Of particular interest has been the exploration of new methods for extracting features from a protein that enhance classification for a given problem. Most methods developed to date, however, have been evaluated in only one or two application areas. Methods have not been explored that generalize well across a number of application areas and datasets. The aim of this study is to find a general method, or an ensemble of methods, that works well on different protein classification datasets and problems. Towards this end, we evaluate several feature extraction approaches for representing proteins starting from their amino acid sequence as well as different feature descriptor combinations using an ensemble of classifiers (support vector machines). In our experiments, more than ten different protein descriptors are compared using nine different datasets. We develop our system using a blind testing protocol, where the parameters of the system are optimized using one dataset and then validated using the other datasets (and so on for each dataset). Although different stand-alone classifiers work well on some datasets and not on others, we have discovered that fusion among different methods obtains a good performance across all the tested datasets, especially when using the weighted sum rule. Included in our feature descriptor combinations is the introduction of two new descriptors, one based on wavelets and the other based on amino acid groups. Using our system, both outperform their standard implementations. We also consider as a baseline the simple amino acid composition (AC) and dipeptide composition (2G), since they have been widely used for protein classification. Our proposed method outperforms AC and 2G.
Copyright 2010 Elsevier Ltd. All rights reserved.
Similar articles
-
An empirical study on the matrix-based protein representations and their combination with sequence-based approaches.Amino Acids. 2013 Mar;44(3):887-901. doi: 10.1007/s00726-012-1416-6. Epub 2012 Oct 30. Amino Acids. 2013. PMID: 23108592
-
An empirical study of different approaches for protein classification.ScientificWorldJournal. 2014;2014:236717. doi: 10.1155/2014/236717. Epub 2014 Jun 15. ScientificWorldJournal. 2014. PMID: 25028675 Free PMC article.
-
Wavelet images and Chou's pseudo amino acid composition for protein classification.Amino Acids. 2012 Aug;43(2):657-65. doi: 10.1007/s00726-011-1114-9. Epub 2011 Oct 13. Amino Acids. 2012. PMID: 21993538
-
Sequence-based protein superfamily classification using computational intelligence techniques: a review.Int J Data Min Bioinform. 2015;11(4):424-57. doi: 10.1504/ijdmb.2015.067957. Int J Data Min Bioinform. 2015. PMID: 26336668 Review.
-
Is protein classification necessary? Toward alternative approaches to function annotation.Curr Opin Struct Biol. 2009 Jun;19(3):363-8. doi: 10.1016/j.sbi.2009.02.001. Epub 2009 Mar 5. Curr Opin Struct Biol. 2009. PMID: 19269161 Free PMC article. Review.
Cited by
-
Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation.BMC Syst Biol. 2015;9 Suppl 1(Suppl 1):S10. doi: 10.1186/1752-0509-9-S1-S10. Epub 2015 Feb 6. BMC Syst Biol. 2015. PMID: 25708928 Free PMC article.
-
PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences.Int J Mol Sci. 2017 May 11;18(5):1029. doi: 10.3390/ijms18051029. Int J Mol Sci. 2017. PMID: 28492483 Free PMC article.
-
Prediction of viral oncoproteins through the combination of generative adversarial networks and machine learning techniques.Sci Rep. 2024 Nov 7;14(1):27108. doi: 10.1038/s41598-024-77028-y. Sci Rep. 2024. PMID: 39511292 Free PMC article.
-
Consistency and variation of protein subcellular location annotations.Proteins. 2021 Feb;89(2):242-250. doi: 10.1002/prot.26010. Epub 2020 Sep 26. Proteins. 2021. PMID: 32935893 Free PMC article.
-
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins.Comput Math Methods Med. 2022 Jun 28;2022:5847242. doi: 10.1155/2022/5847242. eCollection 2022. Comput Math Methods Med. 2022. PMID: 35799660 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources