Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;14 Suppl 14(Suppl 14):S7.
doi: 10.1186/1471-2105-14-S14-S7. Epub 2013 Oct 9.

Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning

Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning

Rakesh Kaundal et al. BMC Bioinformatics. 2013.

Abstract

Background: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.

Results: In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, N(terminal)-Center-C(terminal) composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms.

Conclusion: The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Plastid and its various types with their respective organelle function.
Figure 2
Figure 2
A comaprative bar-graph of amino acid composition differences in plastid and non-plastid proteins.
Figure 3
Figure 3
ROC curve for all five classifiers (AAC, PseAAC, DIPEP, NCC, PhysicoChem) in phase-I prediction; plastid vs. non-plastid proteins identification. AUC = Area Under Curve, AAC = amino acid composition, PseAAC = pseudo amino acid composition, DIPEP = dipeptide composition, NCC = Nterminal-Center-Cterminal composition, PhysicoChem = Protein physicochemical properties.
Figure 4
Figure 4
A comparative bar-graph of amino acid composition differences among various plastid-types; amyloplast, chromoplast, chloroplast and etioplast proteins.
Figure 5
Figure 5
ROC curves for the best classifier (Dipeptide composition-based) in phase-II prediction, i.e. classification of various plastid types (chloroplast, chromoplast, etioplast, amyloplast). Values in parentheses represent Area Under Curve (AUC).

Similar articles

Cited by

References

    1. Kleffmann T, von Zychlinski A, Russenberger D, Hirsch-Hoffmann M, Gehrig P, Gruissem W, Baginsky S. Proteome dynamics during plastid differentiation in rice. Plant physiology. 2007;14(2):912–923. - PMC - PubMed
    1. Cui L, Veeraraghavan N, Richter A, Wall K, Jansen RK, Leebens-Mack J, Makalowska I, dePamphilis CW. ChloroplastDB: the Chloroplast Genome Database. Nucleic acids research. 2006;14(Database):D692–696. - PMC - PubMed
    1. Gewolb J. Bioengineering: plant scientists see big potential in tiny plastids. Science. 2002;14:258–259. doi: 10.1126/science.295.5553.258. - DOI - PubMed
    1. Baginsky S, Grossmann J, Gruissem W. Proteome analysis of chloroplast mRNA processing and degradation. Journal of proteome research. 2007;14(2):809–820. doi: 10.1021/pr060473q. - DOI - PubMed
    1. Siddique MA, Grossmann J, Gruissem W, Baginsky S. Proteome analysis of bell pepper (Capsicum annuum L.) chromoplasts. Plant & cell physiology. 2006;14(12):1663–1673. doi: 10.1093/pcp/pcl033. - DOI - PubMed

Publication types

LinkOut - more resources