Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Nov;14(11):2804-13.
doi: 10.1110/ps.051597405.

A novel representation of protein sequences for prediction of subcellular location using support vector machines

Affiliations

A novel representation of protein sequences for prediction of subcellular location using support vector machines

Setsuro Matsuda et al. Protein Sci. 2005 Nov.

Abstract

As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Distance frequencies of three hydrophobic amino acids (L, I, and V) for 75 protein sequences containing the NES (with NES). The dotted line shows the distance frequencies for the 75 sequences with their NES removed (without NES). Each value of the frequency was divided by sequence length and averaged over the 75 sequences.
Figure 2.
Figure 2.
Distance frequencies of basic amino acids in the N-terminal part (A), basic amino acids in the middle part (B), hydrophobic amino acids in the middle part (C), and other amino acids in the middle part (D), for the TargetP plant proteins. Each value of the frequency was divided by sequence length and averaged over all sequences belonging to each subcellular location. The X-axis is common to the four panels. cTP, mTP, SP, and “other” indicate proteins destined for chloroplast, mitochondria, secretory pathway, and other locations (nucleus and cytosol), respectively.
Figure 3.
Figure 3.
Feature weights of the SVMs specifically for SP (A) and “other” (B) on the TargetP plant data set. Feature number j of the X-axis corresponds to the j-th component of a feature vector. The capital letters represent amino acids and the superscripts indicate a region in a protein sequence. Refer to the definitions of the regions in Figure 4 ▶. h1(M) represents the distance frequency of other amino acids in the middle part.
Figure 4.
Figure 4.
Definitions of the N-terminal, middle, and C-terminal parts depending on sequence length L. dN representsthelengthofaregioninthe N-terminal part (in gray). dC is the lengthofthe C-terminal part (in black).

References

    1. Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29 37–40. - PMC - PubMed
    1. Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. 2004. Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340 783–795. - PubMed
    1. Bhasin, M. and Raghava, G.P.S. 2004. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 32 W414–W419. - PMC - PubMed
    1. Bhasin, M., Garg, A., and Raghava, G.P.S. 2005. PSLpred: Prediction of subcellular localization of bacterial proteins. Bioinformatics 21 2522–2524. - PubMed
    1. Bruce, B.D. 2000. Chloroplast transit peptides: Structure, function and evolution. Trends Biochem. Sci. 10 440–447. - PubMed

Publication types

MeSH terms

LinkOut - more resources