GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank
- PMID: 29522145
- DOI: 10.1093/bioinformatics/bty130
GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank
Abstract
Motivation: Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of >70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore, homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus, the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins.
Methods: The key of this method is to extract not only homology information but also diverse, deep-rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification.
Results: The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods.
Availability and implementation: http://datamining-iip.fudan.edu.cn/golabeler.
Supplementary information: Supplementary data are available at Bioinformatics online.
Similar articles
-
NetGO: improving large-scale protein function prediction with massive network information.Nucleic Acids Res. 2019 Jul 2;47(W1):W379-W387. doi: 10.1093/nar/gkz388. Nucleic Acids Res. 2019. PMID: 31106361 Free PMC article.
-
DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation.Methods. 2018 Aug 1;145:82-90. doi: 10.1016/j.ymeth.2018.05.026. Epub 2018 Jun 6. Methods. 2018. PMID: 29883746
-
DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.Bioinformatics. 2018 Feb 15;34(4):660-668. doi: 10.1093/bioinformatics/btx624. Bioinformatics. 2018. PMID: 29028931 Free PMC article.
-
Automatic Gene Function Prediction in the 2020's.Genes (Basel). 2020 Oct 27;11(11):1264. doi: 10.3390/genes11111264. Genes (Basel). 2020. PMID: 33120976 Free PMC article. Review.
-
Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets.BMC Bioinformatics. 2019 Sep 23;20(1):485. doi: 10.1186/s12859-019-3060-6. BMC Bioinformatics. 2019. PMID: 31547800 Free PMC article. Review.
Cited by
-
NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information.Nucleic Acids Res. 2021 Jul 2;49(W1):W469-W475. doi: 10.1093/nar/gkab398. Nucleic Acids Res. 2021. PMID: 34038555 Free PMC article.
-
GO2Sum: generating human-readable functional summary of proteins from GO terms.NPJ Syst Biol Appl. 2024 Mar 15;10(1):29. doi: 10.1038/s41540-024-00358-0. NPJ Syst Biol Appl. 2024. PMID: 38491038 Free PMC article.
-
PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks.PLoS One. 2021 Feb 25;16(2):e0244430. doi: 10.1371/journal.pone.0244430. eCollection 2021. PLoS One. 2021. PMID: 33630862 Free PMC article.
-
GTPLM-GO: Enhancing Protein Function Prediction Through Dual-Branch Graph Transformer and Protein Language Model Fusing Sequence and Local-Global PPI Information.Int J Mol Sci. 2025 Apr 25;26(9):4088. doi: 10.3390/ijms26094088. Int J Mol Sci. 2025. PMID: 40362328 Free PMC article.
-
ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model.bioRxiv [Preprint]. 2025 May 17:2025.05.13.653854. doi: 10.1101/2025.05.13.653854. bioRxiv. 2025. PMID: 40463264 Free PMC article. Preprint.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources