Identification of sub-Golgi protein localization by use of deep representation learning features

Zhibin Lv¹, Pingping Wang², Quan Zou^{1

3

4}, Qinghua Jiang²

Affiliations

¹ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
² Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China.
³ Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
⁴ Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.

PMID: 33367627
PMCID: PMC8023683
DOI: 10.1093/bioinformatics/btaa1074

Identification of sub-Golgi protein localization by use of deep representation learning features

Zhibin Lv et al. Bioinformatics. 2021.

. 2021 Apr 5;36(24):5600-5609.

doi: 10.1093/bioinformatics/btaa1074.

Authors

Zhibin Lv¹, Pingping Wang², Quan Zou^{1

3

4}, Qinghua Jiang²

Affiliations

¹ Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
² Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China.
³ Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
⁴ Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.

PMID: 33367627
PMCID: PMC8023683
DOI: 10.1093/bioinformatics/btaa1074

Abstract

Motivation: The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting in various neurodegenerative diseases. For a better understanding of the Golgi apparatus, it is essential to identification of sub-Golgi protein localization. Although some machine learning methods have been used to identify sub-Golgi localization proteins by sequence representation fusion, more accurate sub-Golgi protein identification is still challenging by existing methodology.

Results: we developed a protein sub-Golgi localization identification protocol using deep representation learning features with 107 dimensions. By this protocol, we demonstrated that instead of multi-type protein sequence feature representation fusion as in previous state-of-the-art sub-Golgi-protein localization classifiers, it is sufficient to exploit only one type of feature representation for more accurately identification of sub-Golgi proteins. Compared with independent testing results for benchmark datasets, our protocol is able to perform generally, reliably and robustly for sub-Golgi protein localization prediction.

Availabilityand implementation: A use-friendly webserver is freely accessible at http://isGP-DRLF.aibiochem.net and the prediction code is accessible at https://github.com/zhibinlv/isGP-DRLF.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Modeling overview. The Golgi protein sequence is firstly convert into 1900 D features by use of the deep representation learning model, UniRep. Then 1900 D features are fed into ten classifiers; or 1900 D feature vectors are filtered by LGBM feature selection technology to reduce into 250 dimension vectors, which then fed into ten classifiers with SMOTE or not. In the next step, the top 2 classifiers are selected for further optimization with LGBM, ANOVA and MRMD feature selection. Finally, the optimal model (SVM) is used in the isGP-DRLF webserver

**Fig. 2.**
Ten-fold cross-validation accuracy metrics of Boxplots and ROC curves for ten classifiers (LR: Logistic Regression, KNN: K-nearest Neighbors, DT: Decision Tree, NB: Gaussian Naive Bayes, Bagging: Bagging, RF: Random Forest, AB: Ada Boosting, LGBM: Light Gradient Boosting Machine, SVM: Supporting Vector Machine, LDA: Linear Discriminant Analysis) using different feature processing technologies. A and B utilized UniRep feature vectors with 1900 dimensions; C and D used SMOTE to balance the UniRep feature vectors with 1900 dimensions; for E and F, based on the previous steps, 250 features were selected by using the LGBM feature selection method. Green Triangles and orange lines in A, C and E are the average accuracy values and the median accuracy values for the 10-fold cross-validation. In either case, SVM classifier had the highest average accuracy (77.32%, 90.31% and 90.76%, respectively) and the highest average auROC value (0.765, 0.940 and 0.958, respectively)

**Fig. 3.**
(A) Based on benchmark dataset D3, the average 10-fold cross-validation accuracy varied with the feature numbers for LGBM and SVM classifiers based on ANOVA, MRMD and LGBM feature selection technology. The best SVM had an accuracy of 92.16% with 158 features. The best LGBM classifier had an accuracy of 93.08% with 64 features. Both were based on LGBM feature selection technology. (B) Ten-fold cross-validation and LOO metrics for comparison of the best SVM (based on benchmark dataset D3 and D5) and LGBM classifier (based on benchmark dataset D3). (C) Independent test metrics on benchmark testing dataset D4 for the best SVM and LGBM classifier obtained by LOO using benchmark dataset D3 and D5

**Fig. 4.**
Human sub-Golgi proteome sequence distribution and the results of isGP-DRLF and suGolgi2 tested on human sub-Golgi proteome dataset

See this image and copyright information in PMC

Cited by

Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE.
Wang C, Zou Q. Wang C, et al. BMC Biol. 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8. BMC Biol. 2023. PMID: 36694239 Free PMC article.
SYNBIP: synthetic binding proteins for research, diagnosis and therapy.
Wang X, Li F, Qiu W, Xu B, Li Y, Lian X, Yu H, Zhang Z, Wang J, Li Z, Xue W, Zhu F. Wang X, et al. Nucleic Acids Res. 2022 Jan 7;50(D1):D560-D570. doi: 10.1093/nar/gkab926. Nucleic Acids Res. 2022. PMID: 34664670 Free PMC article.
AACFlow: an end-to-end model based on attention augmented convolutional neural network and flow-attention mechanism for identification of anticancer peptides.
Zhang S, Zhao Y, Liang Y. Zhang S, et al. Bioinformatics. 2024 Mar 4;40(3):btae142. doi: 10.1093/bioinformatics/btae142. Bioinformatics. 2024. PMID: 38452348 Free PMC article.
Identification of plant vacuole proteins by exploiting deep representation learning features.
Jiao S, Zou Q. Jiao S, et al. Comput Struct Biotechnol J. 2022 Jun 8;20:2921-2927. doi: 10.1016/j.csbj.2022.06.002. eCollection 2022. Comput Struct Biotechnol J. 2022. PMID: 35765653 Free PMC article.
Recent Advances in Predicting Protein S-Nitrosylation Sites.
Zhao Q, Ma J, Xie F, Wang Y, Zhang Y, Li H, Sun Y, Wang L, Guo M, Han K. Zhao Q, et al. Biomed Res Int. 2021 Feb 9;2021:5542224. doi: 10.1155/2021/5542224. eCollection 2021. Biomed Res Int. 2021. PMID: 33628788 Free PMC article. Review.

See all "Cited by" articles

References

1. Ahmad J. et al. (2019) MFSC: multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components. J. Theor. Biol., 463, 99–109. - PubMed
1. Ahmad J. et al. (2017) Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif. Intell. Med., 78, 14–22. - PubMed
1. Alley E.C. et al. (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16, 1315–1322. - PMC - PubMed
1. Armenteros J.J.A. et al. (2017) DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33, 4049–4049. - PubMed
1. Armenteros J.J.A. et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol., 37, 420. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification of sub-Golgi protein localization by use of deep representation learning features

Affiliations

Identification of sub-Golgi protein localization by use of deep representation learning features

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources