Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 6;24(5):919.
doi: 10.3390/molecules24050919.

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

Affiliations

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

Bo Li et al. Molecules. .

Abstract

The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou's pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.

Keywords: generalized chaos game representation; protein primary sequence; protein subcellular localization; statistical method; support vector machine; unitary distance.

PubMed Disclaimer

Conflict of interest statement

The authors confirm that this article content has no conflict of interest.

Figures

Figure 1
Figure 1
The prediction results based on CL317 using the support vector machine algorithm with different combination of features.
Figure 2
Figure 2
The GCGRs of primary sequence for proteins from six subcellular locations GCGR: Generalized Chaos Game Representation.
Figure 3
Figure 3
Six time series that represent the first three GCGRs in Figure 2. Each panel in Figure 2 gives rise to two time series.
Figure 4
Figure 4
Six time series that represent the last three GCGRs in Figure 2. Each panel in Figure 2 gives rise to two time series.
Figure 5
Figure 5
The boxplots for the x¯ and y¯ of all the proteins in dataset CL317 grouped into the six subcellular locations.

References

    1. Yu D., Wu X., Shen H., Yang J., Tang Z., Qi Y., Yang J. Enhancing membrane protein subcellular localization prediction by parallel fusion of multi-view features. IEEE Trans. Nanobiosci. 2012;11:375–385. doi: 10.1109/TNB.2012.2208473. - DOI - PubMed
    1. Kuo-Chen C., Yu-Dong C. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 2002;277:45765–45769. - PubMed
    1. Ernst J., Bar-Joseph Z. STEM: A tool for the analysis of short time series gene expression data. BMC Bioinform. 2006;7:191. doi: 10.1186/1471-2105-7-191. - DOI - PMC - PubMed
    1. Mei S., Fei W., Zhou S. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinform. 2011;12:44. doi: 10.1186/1471-2105-12-44. - DOI - PMC - PubMed
    1. Wang Z., Zou Q., Jiang Y., Ju Y., Zeng X. Review of Protein Subcellular Localization Prediction. Curr. Bioinform. 2014;9:331–342. doi: 10.2174/1574893609666140212000304. - DOI

LinkOut - more resources