Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008;9 Suppl 1(Suppl 1):S16.
doi: 10.1186/1471-2164-9-S1-S16.

Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition

Affiliations

Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition

Tanwir Habib et al. BMC Genomics. 2008.

Abstract

Background: Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy.

Results: We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%.

Conclusions: A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Bar chart comparing mean true negative fraction (TNf), mean true positive fraction (TPf) and mean total accuracy (TA) for 420 features dataset with SVM using RBF, Polynomial and Linear kernel.
Figure 2
Figure 2
Bar chart displaying total accuracy for 420 features dataset with SVM using RBF kernel.
Figure 3
Figure 3
Bar chart comparing mean true negative fraction (TNf), mean true positive fraction (TPf) and mean total accuracy (TA) for 420 and 20 features dataset with SVM using RBF kernel.
Figure 4
Figure 4
Bar chart comparing TNf, TPf and TA for 420 features dataset using 12, 9 and 5 subcellular locations. SVM is used with Polynomial kernel.
Figure 5
Figure 5
Bar chart comparing TNf, TPf and TA for 420 features dataset with balanced and unbalanced data. SVM is used with Polynomial kernel.
Figure 6
Figure 6
ROC curve analysis for RBF, Polynomial and linear kernel.

Similar articles

Cited by

References

    1. Eisenhaber F, Bork P. Wanted: Subcellular localization of proteins based on sequence. Trends in Cell Biology. 1998;8:169–170. doi: 10.1016/S0962-8924(98)01226-4. - DOI - PubMed
    1. Emanuelsson O, Brunak S, von Heijne G. Locating proteins in the cell using TargetP, SignalP, and related tools. Nature Protocols 2. 2007:953–971. doi: 10.1038/nprot.2007.131. - DOI - PubMed
    1. Claros MG, Brunak S, von Heijne G. Prediction of N-terminal protein sorting signals. Curr Opin Struct Biol. 1997;7:394–398. doi: 10.1016/S0959-440X(97)80057-7. - DOI - PubMed
    1. Reinhardt A, Hubbard T. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 1998;26:2230–2236. doi: 10.1093/nar/26.9.2230. - DOI - PMC - PubMed
    1. Von Heijne G. A new method for predicting sequence cleavage site. Nucleic Acids Res. 1986;14:4683–4690. doi: 10.1093/nar/14.11.4683. - DOI - PMC - PubMed

Publication types

LinkOut - more resources