. 2024 Feb 1;25(1):41.

doi: 10.1186/s13059-024-03166-1.

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Lingyan Zheng^#^{1

2}, Shuiyang Shi^#¹, Mingkun Lu^#¹, Pan Fang^{2

3}, Ziqi Pan¹, Hongning Zhang¹, Zhimeng Zhou¹, Hanyu Zhang¹, Minjie Mou¹, Shijie Huang¹, Lin Tao⁴, Weiqi Xia⁵, Honglin Li⁶, Zhenyu Zeng^{2

3}, Shun Zhang^{2

3}, Yuzong Chen⁷, Zhaorong Li^{8

9}, Feng Zhu^{10

11

12}

Affiliations

¹ College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China.
² Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China.
³ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
⁴ Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
⁵ Pharmaceutical Department, Zhejiang Provincial People's Hospital, Hangzhou, 310014, China.
⁶ School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
⁷ State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, China.
⁸ Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China. zhaorong.lzr@alibaba-inc.com.
⁹ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China. zhaorong.lzr@alibaba-inc.com.
¹⁰ College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China. zhufeng@zju.edu.cn.
¹¹ Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China. zhufeng@zju.edu.cn.
¹² Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China. zhufeng@zju.edu.cn.

^# Contributed equally.

PMID: 38303023
PMCID: PMC10832132
DOI: 10.1186/s13059-024-03166-1

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Lingyan Zheng et al. Genome Biol. 2024.

. 2024 Feb 1;25(1):41.

doi: 10.1186/s13059-024-03166-1.

Authors

Affiliations

¹ College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China.
² Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China.
³ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
⁴ Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
⁵ Pharmaceutical Department, Zhejiang Provincial People's Hospital, Hangzhou, 310014, China.
⁶ School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
⁷ State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, China.
⁸ Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China. zhaorong.lzr@alibaba-inc.com.
⁹ Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China. zhaorong.lzr@alibaba-inc.com.
¹⁰ College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China. zhufeng@zju.edu.cn.
¹¹ Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou, 330110, China. zhufeng@zju.edu.cn.
¹² Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China. zhufeng@zju.edu.cn.

^# Contributed equally.

PMID: 38303023
PMCID: PMC10832132
DOI: 10.1186/s13059-024-03166-1

Abstract

Protein function annotation has been one of the longstanding issues in biological sciences, and various computational methods have been developed. However, the existing methods suffer from a serious long-tail problem, with a large number of GO families containing few annotated proteins. Herein, an innovative strategy named AnnoPRO was therefore constructed by enabling sequence-based multi-scale protein representation, dual-path protein encoding using pre-training, and function annotation by long short-term memory-based decoding. A variety of case studies based on different benchmarks were conducted, which confirmed the superior performance of AnnoPRO among available methods. Source code and models have been made freely available at: https://github.com/idrblab/AnnoPRO and https://zenodo.org/records/10012272.

Keywords: LSTM; Long-tail problem; Pre-training; Protein function annotation; Protein representation.

PubMed Disclaimer

Conflict of interest statement

P.F., Z.Y.Z, S.Z. and Z.R.L. are employed by Alibaba. The authors declare no competing interests.

Figures

**Fig. 1**
Average number of proteins (ANP) in the GO families of nine different levels (LEVEL 2 to LEVEL 10 as shown in Additional file 1: Fig. S3). There was a clear descending trend of ANPs from the top level (LEVEL 2) to the bottom one (LEVEL 10). Since the ANP of one family indicated its representativeness among all families, this figure denoted a gradual decrease of the representativeness of a family with the penetration into deeper level. Therefore, the nine levels could be classified into two groups based on their ANPs: the “*Head Label Levels*” (ANP of their GO families ≥ 2,000) and the “*Tail Label Levels*” (ANP of their GO families < 2,000). As shown, the total number (5,323) of GO families in the “*Tail Label Levels*” was > 10 times larger than that (459) of the “*Head Label Levels*”, and such kind of data distribution induced a serious ‘*long-tail problem*’ as described in the previous pioneering publication [18]

**Fig. 2**
The hybrid deep learning framework of three consecutive modules (M1 to M3) adopted in this study. (M1) the sequence-based multi-scale protein representation realizing conversion of all protein sequences to *feature similarity*-based images (*ProMAP*) and *protein similarity*-based vectors (*ProSIM*). (M2) the dual-path protein encoding based on pre-training. Using the *ProMAP* and *ProSIM* generated for all the sequences, a dual-path encoding strategy was constructed based on a seven-channel *Convolutional Neural Network* (7C-CNN) and *Deep Neural Network* of five fully-connected layers (5FC-DNN) to pre-train the features of all CAFA4 proteins by integrating their annotation data of GO families. (M3) the functional annotation by a LSTM-based decoding. The protein features pre-trained using the dual-path encoding layer in M2 were concatenated and then fed into a *long short-term memory recurrent neural network* (LSTM) to enable a multi-label annotation of proteins to 6,109 functional GO families using the hybrid deep learning

**Fig. 3**
A schematic illustration of the procedure used in this study facilitating sequence-based multi-scale protein representation. The way how sequences were converted to *feature similarity*-based image (*ProMAP*) and *protein similarity*-based vector (*ProSIM*) was shown. (a) generation of feature/protein distance matrix and ‘*template map*’; (b) production of *ProSIM* (based on PDM) and *ProMAP* (based on *template map*) for each protein. On the one hand, a method realizing the image-like protein representation was constructed (*ProMAP*) to capture the intrinsic correlations among protein features. As illustrated, a *template map* for each protein was *first* constructed by a consecutive process of ‘*protein representation*’ using PROFEAT, ‘*similarity calculation*’ using cosine similarity, ‘*dimensionality reduction*’ using UMAP or PCA, ‘*coordinate allocation*’ using *Jonker-Volgenant algorithm*, etc. Then, *ProMAP* was produced for each protein by mapping the intensities of all protein features to their corresponding locations in the constructed *template map* (illustrated on the right side of Fig. 3b). On the other hand, an approach considering the global relevance among proteins was proposed (*ProSIM*) to convert ‘independent’ vector to a ‘globally-relevant’ protein representation. As shown, a *protein distance matrix* (PDM) was first generated by following the consecutive process of ‘*protein representation*’ using PROFEAT and ‘*similarity calculation*’ using cosine similarity. Finally, *ProSIM* was generated for each protein by retrieving directly from each row of the newly generated PDM (shown in the left side of Fig. 3b)

**Fig. 4**
A comparison among the performances of *AnnoPRO* and three representative methods. The performances were represented using AUC values in predicting the experimentally validated new protein functions that were not included in CAFA4 data, and the performances of *AnnoPRO*, *DeepGOPlus*, *NetGO3* and *PFmulDL* were highlighted in light red, light green, orange and light blue, respectively. For GO families in the ‘*Head Label Levels*’ (LEVEL 2 and LEVEL 3 provided in Additional file 1: Fig. S3), the performance of *AnnoPRO* was roughly as good as that of the other three methods (1.4 ~ 4.1% improvements in most cases, but 0.1% decline in one single case). For the GO families in the ‘*Tail Label Levels*’ (LEVEL 4 to LEVEL 10 shown in Additional file 1: Fig. S3), *AnnoPRO* demonstrated the consistently superior performance among four methods (1.7 ~ 28.2% improvements in all cases). Particularly, 13 (61.9%) out of all 21 improvements were over 5%, and 6 (28.6%) out of 21 improvements were more than 10%. Therefore, *AnnoPRO* was identified *superior* in significantly improving the annotation performances of the families in ‘*Tail Label Levels*’ without sacrificing that of the ‘*Head Label Levels*’, which was highly expected to make contribution to solving the long-standing ‘*long-tail problem*’[18] in functional annotation

**Fig. 5**
Performance assessment of four methods using two well-known *growth differentiation factors* (GDF8, GDF11). As reported, the interaction between GDF8 and follistatin-288 (FS288) formed a protein complex to bind ‘heparin’, which defined the molecular mechanisms underlying GDF8’s key GO family: ‘*heparin binding*’ (GO:0008201) [52]. Different from GDF8, the varied residues in GDF11 made it unable to interact with FS288, and it therefore suffered from the loss of the ‘*heparin binding*’ function [53]. (a) Sequence alignment between GDF8 and GDF11, where varied residues between two GDFs were marked in light green and blue background, respectively. Three residue pairs (F315Y, V316M, and L318M on the binding surface between the GDF8 and FS288) which were found as key residue indicating GDFs’ ‘*heparin binding*’ function [55], were given in pink background. (b) Structure superimposition between GDF8 (light green) and GDF11 (blue) and their interactions with FS288 (gray surface). As highlighted in pink background, three residue pairs (F315Y, V316M, L318M) located in the binding interface between GDF and FS288. (c) function annotation results predicted by the methods. If a GO family is successfully predicted by a method, a colored circle would be adopted to indicate that prediction result. Particularly, a successful prediction made by *AnnoPRO*, *NetGO3*, *PFmulDL* or *DeepGOPlus* was indicated by a circle of light red, orange, light blue or light green, respectively. As described, *AnnoPRO* is the only one that can successfully predict all GO families for both GDFs

**Fig. 6**
A comparison among the performances of *AnnoPRO* and three methods (*DeepGOPlus*, *PFmulDL*, and *BPM*) under six GO categories using the same sub-datasets and partition strategy as that of a previous publication [32]. *BPM*: the best-performing methods for the ‘ontology-based PFP benchmark’ in that original publication. The performances were assessed based on F_max, and the performances of *AnnoPRO*, *BPM*, *DeepGOPlus*, and *PFmulDL* were highlighted in light red, orange, light green, and light blue, respectively. Each of those quadrangular-stars represented the best-performing model under a particular GO category and GO class. (a) *Biological Process*; (b) *Molecular Function*; and (c) *Cellular Component*. As illustrated, the *AnnoPRO* demonstrated the best performances in the vast majority (17 out of 18) of the studied GO categories

See this image and copyright information in PMC

References

1. Huang J, Lin Q, Fei H, He Z, Xu H, Li Y, et al. Discovery of deaminase functions by structure-based protein clustering. Cell. 2023;186:3182–3195. - PubMed
1. Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun. 2021;12:3168. - PMC - PubMed
1. Espinosa-Cantú A, Cruz-Bonilla E, Noda-Garcia L, DeLuna A. Multiple forms of multifunctional proteins in health and disease. Front Cell Dev Biol. 2020;8:451. - PMC - PubMed
1. UniProt C. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. - PMC - PubMed
1. Colin PY, Kintses B, Gielen F, Miton CM, Fischer G, Mohamed MF, et al. Ultrahigh-throughput discovery of promiscuous enzymes by picodroplet functional metagenomics. Nat Commun. 2015;6:10008. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Affiliations

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources