Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
- PMID: 24266945
- PMCID: PMC3851450
- DOI: 10.1186/1471-2105-14-S14-S7
Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning
Abstract
Background: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.
Results: In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, N(terminal)-Center-C(terminal) composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms.
Conclusion: The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.
Figures





Similar articles
-
A machine learning based method for the prediction of secretory proteins using amino acid composition, their order and similarity-search.In Silico Biol. 2008;8(2):129-40. In Silico Biol. 2008. PMID: 18928201
-
ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST.Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W414-9. doi: 10.1093/nar/gkh350. Nucleic Acids Res. 2004. PMID: 15215421 Free PMC article.
-
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition.BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S3. doi: 10.1186/1471-2105-13-S17-S3. Epub 2012 Dec 13. BMC Bioinformatics. 2012. PMID: 23282103 Free PMC article.
-
Differentiation of chromoplasts and other plastids in plants.Plant Cell Rep. 2019 Jul;38(7):803-818. doi: 10.1007/s00299-019-02420-2. Epub 2019 May 11. Plant Cell Rep. 2019. PMID: 31079194 Free PMC article. Review.
-
Emerging facets of plastid division regulation.Planta. 2013 Feb;237(2):389-98. doi: 10.1007/s00425-012-1743-6. Epub 2012 Sep 11. Planta. 2013. PMID: 22965912 Review.
Cited by
-
SCMPSP: Prediction and characterization of photosynthetic proteins based on a scoring card method.BMC Bioinformatics. 2015;16 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2105-16-S1-S8. Epub 2015 Jan 21. BMC Bioinformatics. 2015. PMID: 25708243 Free PMC article.
-
LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches.BMC Bioinformatics. 2014;15 Suppl 11(Suppl 11):S15. doi: 10.1186/1471-2105-15-S11-S15. Epub 2014 Oct 21. BMC Bioinformatics. 2014. PMID: 25350584 Free PMC article.
-
Machine-Learning Classification Suggests That Many Alphaproteobacterial Prophages May Instead Be Gene Transfer Agents.Genome Biol Evol. 2019 Oct 1;11(10):2941-2953. doi: 10.1093/gbe/evz206. Genome Biol Evol. 2019. PMID: 31560374 Free PMC article.
-
Proceedings of the 2013 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference.BMC Bioinformatics. 2013;14 Suppl 14(Suppl 14):S1. doi: 10.1186/1471-2105-14-S14-S1. Epub 2013 Oct 9. BMC Bioinformatics. 2013. PMID: 24267415 Free PMC article.
-
Recent Advances in the Prediction of Subcellular Localization of Proteins and Related Topics.Front Bioinform. 2022 May 19;2:910531. doi: 10.3389/fbinf.2022.910531. eCollection 2022. Front Bioinform. 2022. PMID: 36304291 Free PMC article. Review.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials