Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 5;36(24):5600-5609.
doi: 10.1093/bioinformatics/btaa1074.

Identification of sub-Golgi protein localization by use of deep representation learning features

Affiliations

Identification of sub-Golgi protein localization by use of deep representation learning features

Zhibin Lv et al. Bioinformatics. .

Abstract

Motivation: The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting in various neurodegenerative diseases. For a better understanding of the Golgi apparatus, it is essential to identification of sub-Golgi protein localization. Although some machine learning methods have been used to identify sub-Golgi localization proteins by sequence representation fusion, more accurate sub-Golgi protein identification is still challenging by existing methodology.

Results: we developed a protein sub-Golgi localization identification protocol using deep representation learning features with 107 dimensions. By this protocol, we demonstrated that instead of multi-type protein sequence feature representation fusion as in previous state-of-the-art sub-Golgi-protein localization classifiers, it is sufficient to exploit only one type of feature representation for more accurately identification of sub-Golgi proteins. Compared with independent testing results for benchmark datasets, our protocol is able to perform generally, reliably and robustly for sub-Golgi protein localization prediction.

Availabilityand implementation: A use-friendly webserver is freely accessible at http://isGP-DRLF.aibiochem.net and the prediction code is accessible at https://github.com/zhibinlv/isGP-DRLF.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Modeling overview. The Golgi protein sequence is firstly convert into 1900 D features by use of the deep representation learning model, UniRep. Then 1900 D features are fed into ten classifiers; or 1900 D feature vectors are filtered by LGBM feature selection technology to reduce into 250 dimension vectors, which then fed into ten classifiers with SMOTE or not. In the next step, the top 2 classifiers are selected for further optimization with LGBM, ANOVA and MRMD feature selection. Finally, the optimal model (SVM) is used in the isGP-DRLF webserver
Fig. 2.
Fig. 2.
Ten-fold cross-validation accuracy metrics of Boxplots and ROC curves for ten classifiers (LR: Logistic Regression, KNN: K-nearest Neighbors, DT: Decision Tree, NB: Gaussian Naive Bayes, Bagging: Bagging, RF: Random Forest, AB: Ada Boosting, LGBM: Light Gradient Boosting Machine, SVM: Supporting Vector Machine, LDA: Linear Discriminant Analysis) using different feature processing technologies. A and B utilized UniRep feature vectors with 1900 dimensions; C and D used SMOTE to balance the UniRep feature vectors with 1900 dimensions; for E and F, based on the previous steps, 250 features were selected by using the LGBM feature selection method. Green Triangles and orange lines in A, C and E are the average accuracy values and the median accuracy values for the 10-fold cross-validation. In either case, SVM classifier had the highest average accuracy (77.32%, 90.31% and 90.76%, respectively) and the highest average auROC value (0.765, 0.940 and 0.958, respectively)
Fig. 3.
Fig. 3.
(A) Based on benchmark dataset D3, the average 10-fold cross-validation accuracy varied with the feature numbers for LGBM and SVM classifiers based on ANOVA, MRMD and LGBM feature selection technology. The best SVM had an accuracy of 92.16% with 158 features. The best LGBM classifier had an accuracy of 93.08% with 64 features. Both were based on LGBM feature selection technology. (B) Ten-fold cross-validation and LOO metrics for comparison of the best SVM (based on benchmark dataset D3 and D5) and LGBM classifier (based on benchmark dataset D3). (C) Independent test metrics on benchmark testing dataset D4 for the best SVM and LGBM classifier obtained by LOO using benchmark dataset D3 and D5
Fig. 4.
Fig. 4.
Human sub-Golgi proteome sequence distribution and the results of isGP-DRLF and suGolgi2 tested on human sub-Golgi proteome dataset

Similar articles

Cited by

References

    1. Ahmad J. et al. (2019) MFSC: multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components. J. Theor. Biol., 463, 99–109. - PubMed
    1. Ahmad J. et al. (2017) Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif. Intell. Med., 78, 14–22. - PubMed
    1. Alley E.C. et al. (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16, 1315–1322. - PMC - PubMed
    1. Armenteros J.J.A. et al. (2017) DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33, 4049–4049. - PubMed
    1. Armenteros J.J.A. et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol., 37, 420. - PubMed