High performance set of PseAAC and sequence based descriptors for protein classification

Loris Nanni¹, Sheryl Brahnam, Alessandra Lumini

Affiliations

PMID: 20558184
DOI: 10.1016/j.jtbi.2010.06.006

High performance set of PseAAC and sequence based descriptors for protein classification

Loris Nanni et al. J Theor Biol. 2010.

. 2010 Sep 7;266(1):1-10.

doi: 10.1016/j.jtbi.2010.06.006. Epub 2010 Jun 15.

Authors

Loris Nanni¹, Sheryl Brahnam, Alessandra Lumini

Affiliation

¹ Department of Electronic, Informatics and Systems (DEIS), Università di Bologna, Via Venezia 52, 47023 Cesena, Italy. loris.nanni@unibo.it

PMID: 20558184
DOI: 10.1016/j.jtbi.2010.06.006

Abstract

The study of reliable automatic systems for protein classification is important for several domains, including finding novel drugs and vaccines. The last decade has seen a number of advances in the development of reliable systems for classifying proteins. Of particular interest has been the exploration of new methods for extracting features from a protein that enhance classification for a given problem. Most methods developed to date, however, have been evaluated in only one or two application areas. Methods have not been explored that generalize well across a number of application areas and datasets. The aim of this study is to find a general method, or an ensemble of methods, that works well on different protein classification datasets and problems. Towards this end, we evaluate several feature extraction approaches for representing proteins starting from their amino acid sequence as well as different feature descriptor combinations using an ensemble of classifiers (support vector machines). In our experiments, more than ten different protein descriptors are compared using nine different datasets. We develop our system using a blind testing protocol, where the parameters of the system are optimized using one dataset and then validated using the other datasets (and so on for each dataset). Although different stand-alone classifiers work well on some datasets and not on others, we have discovered that fusion among different methods obtains a good performance across all the tested datasets, especially when using the weighted sum rule. Included in our feature descriptor combinations is the introduction of two new descriptors, one based on wavelets and the other based on amino acid groups. Using our system, both outperform their standard implementations. We also consider as a baseline the simple amino acid composition (AC) and dipeptide composition (2G), since they have been widely used for protein classification. Our proposed method outperforms AC and 2G.

PubMed Disclaimer

Cited by

Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation.
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Xu R, et al. BMC Syst Biol. 2015;9 Suppl 1(Suppl 1):S10. doi: 10.1186/1752-0509-9-S1-S10. Epub 2015 Feb 6. BMC Syst Biol. 2015. PMID: 25708928 Free PMC article.
PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences.
Wang Y, You Z, Li X, Chen X, Jiang T, Zhang J. Wang Y, et al. Int J Mol Sci. 2017 May 11;18(5):1029. doi: 10.3390/ijms18051029. Int J Mol Sci. 2017. PMID: 28492483 Free PMC article.
Prediction of viral oncoproteins through the combination of generative adversarial networks and machine learning techniques.
Beltrán JF, Herrera-Belén L, Yáñez AJ, Jimenez L. Beltrán JF, et al. Sci Rep. 2024 Nov 7;14(1):27108. doi: 10.1038/s41598-024-77028-y. Sci Rep. 2024. PMID: 39511292 Free PMC article.
Consistency and variation of protein subcellular location annotations.
Xu YY, Zhou H, Murphy RF, Shen HB. Xu YY, et al. Proteins. 2021 Feb;89(2):242-250. doi: 10.1002/prot.26010. Epub 2020 Sep 26. Proteins. 2021. PMID: 32935893 Free PMC article.
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins.
Chen D, Zhang H, Chen Z, Xie B, Wang Y. Chen D, et al. Comput Math Methods Med. 2022 Jun 28;2022:5847242. doi: 10.1155/2022/5847242. eCollection 2022. Comput Math Methods Med. 2022. PMID: 35799660 Free PMC article.

See all "Cited by" articles

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

High performance set of PseAAC and sequence based descriptors for protein classification

Affiliation

High performance set of PseAAC and sequence based descriptors for protein classification

Authors

Affiliation

Abstract

Similar articles

Cited by

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Similar articles

Cited by

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources