Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 1;25(1):41.
doi: 10.1186/s13059-024-03166-1.

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Affiliations

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Lingyan Zheng et al. Genome Biol. .

Abstract

Protein function annotation has been one of the longstanding issues in biological sciences, and various computational methods have been developed. However, the existing methods suffer from a serious long-tail problem, with a large number of GO families containing few annotated proteins. Herein, an innovative strategy named AnnoPRO was therefore constructed by enabling sequence-based multi-scale protein representation, dual-path protein encoding using pre-training, and function annotation by long short-term memory-based decoding. A variety of case studies based on different benchmarks were conducted, which confirmed the superior performance of AnnoPRO among available methods. Source code and models have been made freely available at: https://github.com/idrblab/AnnoPRO and https://zenodo.org/records/10012272.

Keywords: LSTM; Long-tail problem; Pre-training; Protein function annotation; Protein representation.

PubMed Disclaimer

Conflict of interest statement

P.F., Z.Y.Z, S.Z. and Z.R.L. are employed by Alibaba. The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Average number of proteins (ANP) in the GO families of nine different levels (LEVEL 2 to LEVEL 10 as shown in Additional file 1: Fig. S3). There was a clear descending trend of ANPs from the top level (LEVEL 2) to the bottom one (LEVEL 10). Since the ANP of one family indicated its representativeness among all families, this figure denoted a gradual decrease of the representativeness of a family with the penetration into deeper level. Therefore, the nine levels could be classified into two groups based on their ANPs: the “Head Label Levels” (ANP of their GO families ≥ 2,000) and the “Tail Label Levels” (ANP of their GO families < 2,000). As shown, the total number (5,323) of GO families in the “Tail Label Levels” was > 10 times larger than that (459) of the “Head Label Levels”, and such kind of data distribution induced a serious ‘long-tail problem’ as described in the previous pioneering publication [18]
Fig. 2
Fig. 2
The hybrid deep learning framework of three consecutive modules (M1 to M3) adopted in this study. (M1) the sequence-based multi-scale protein representation realizing conversion of all protein sequences to feature similarity-based images (ProMAP) and protein similarity-based vectors (ProSIM). (M2) the dual-path protein encoding based on pre-training. Using the ProMAP and ProSIM generated for all the sequences, a dual-path encoding strategy was constructed based on a seven-channel Convolutional Neural Network (7C-CNN) and Deep Neural Network of five fully-connected layers (5FC-DNN) to pre-train the features of all CAFA4 proteins by integrating their annotation data of GO families. (M3) the functional annotation by a LSTM-based decoding. The protein features pre-trained using the dual-path encoding layer in M2 were concatenated and then fed into a long short-term memory recurrent neural network (LSTM) to enable a multi-label annotation of proteins to 6,109 functional GO families using the hybrid deep learning
Fig. 3
Fig. 3
A schematic illustration of the procedure used in this study facilitating sequence-based multi-scale protein representation. The way how sequences were converted to feature similarity-based image (ProMAP) and protein similarity-based vector (ProSIM) was shown. (a) generation of feature/protein distance matrix and ‘template map’; (b) production of ProSIM (based on PDM) and ProMAP (based on template map) for each protein. On the one hand, a method realizing the image-like protein representation was constructed (ProMAP) to capture the intrinsic correlations among protein features. As illustrated, a template map for each protein was first constructed by a consecutive process of ‘protein representation’ using PROFEAT, ‘similarity calculation’ using cosine similarity, ‘dimensionality reduction’ using UMAP or PCA, ‘coordinate allocation’ using Jonker-Volgenant algorithm, etc. Then, ProMAP was produced for each protein by mapping the intensities of all protein features to their corresponding locations in the constructed template map (illustrated on the right side of Fig. 3b). On the other hand, an approach considering the global relevance among proteins was proposed (ProSIM) to convert ‘independent’ vector to a ‘globally-relevant’ protein representation. As shown, a protein distance matrix (PDM) was first generated by following the consecutive process of ‘protein representation’ using PROFEAT and ‘similarity calculation’ using cosine similarity. Finally, ProSIM was generated for each protein by retrieving directly from each row of the newly generated PDM (shown in the left side of Fig. 3b)
Fig. 4
Fig. 4
A comparison among the performances of AnnoPRO and three representative methods. The performances were represented using AUC values in predicting the experimentally validated new protein functions that were not included in CAFA4 data, and the performances of AnnoPRO, DeepGOPlus, NetGO3 and PFmulDL were highlighted in light red, light green, orange and light blue, respectively. For GO families in the ‘Head Label Levels’ (LEVEL 2 and LEVEL 3 provided in Additional file 1: Fig. S3), the performance of AnnoPRO was roughly as good as that of the other three methods (1.4 ~ 4.1% improvements in most cases, but 0.1% decline in one single case). For the GO families in the ‘Tail Label Levels’ (LEVEL 4 to LEVEL 10 shown in Additional file 1: Fig. S3), AnnoPRO demonstrated the consistently superior performance among four methods (1.7 ~ 28.2% improvements in all cases). Particularly, 13 (61.9%) out of all 21 improvements were over 5%, and 6 (28.6%) out of 21 improvements were more than 10%. Therefore, AnnoPRO was identified superior in significantly improving the annotation performances of the families in ‘Tail Label Levels’ without sacrificing that of the ‘Head Label Levels’, which was highly expected to make contribution to solving the long-standing ‘long-tail problem’[18] in functional annotation
Fig. 5
Fig. 5
Performance assessment of four methods using two well-known growth differentiation factors (GDF8, GDF11). As reported, the interaction between GDF8 and follistatin-288 (FS288) formed a protein complex to bind ‘heparin’, which defined the molecular mechanisms underlying GDF8’s key GO family: ‘heparin binding’ (GO:0008201) [52]. Different from GDF8, the varied residues in GDF11 made it unable to interact with FS288, and it therefore suffered from the loss of the ‘heparin binding’ function [53]. (a) Sequence alignment between GDF8 and GDF11, where varied residues between two GDFs were marked in light green and blue background, respectively. Three residue pairs (F315Y, V316M, and L318M on the binding surface between the GDF8 and FS288) which were found as key residue indicating GDFs’ ‘heparin binding’ function [55], were given in pink background. (b) Structure superimposition between GDF8 (light green) and GDF11 (blue) and their interactions with FS288 (gray surface). As highlighted in pink background, three residue pairs (F315Y, V316M, L318M) located in the binding interface between GDF and FS288. (c) function annotation results predicted by the methods. If a GO family is successfully predicted by a method, a colored circle would be adopted to indicate that prediction result. Particularly, a successful prediction made by AnnoPRO, NetGO3, PFmulDL or DeepGOPlus was indicated by a circle of light red, orange, light blue or light green, respectively. As described, AnnoPRO is the only one that can successfully predict all GO families for both GDFs
Fig. 6
Fig. 6
A comparison among the performances of AnnoPRO and three methods (DeepGOPlus, PFmulDL, and BPM) under six GO categories using the same sub-datasets and partition strategy as that of a previous publication [32]. BPM: the best-performing methods for the ‘ontology-based PFP benchmark’ in that original publication. The performances were assessed based on Fmax, and the performances of AnnoPRO, BPM, DeepGOPlus, and PFmulDL were highlighted in light red, orange, light green, and light blue, respectively. Each of those quadrangular-stars represented the best-performing model under a particular GO category and GO class. (a) Biological Process; (b) Molecular Function; and (c) Cellular Component. As illustrated, the AnnoPRO demonstrated the best performances in the vast majority (17 out of 18) of the studied GO categories

References

    1. Huang J, Lin Q, Fei H, He Z, Xu H, Li Y, et al. Discovery of deaminase functions by structure-based protein clustering. Cell. 2023;186:3182–3195. - PubMed
    1. Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun. 2021;12:3168. - PMC - PubMed
    1. Espinosa-Cantú A, Cruz-Bonilla E, Noda-Garcia L, DeLuna A. Multiple forms of multifunctional proteins in health and disease. Front Cell Dev Biol. 2020;8:451. - PMC - PubMed
    1. UniProt C. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. - PMC - PubMed
    1. Colin PY, Kintses B, Gielen F, Miton CM, Fischer G, Mohamed MF, et al. Ultrahigh-throughput discovery of promiscuous enzymes by picodroplet functional metagenomics. Nat Commun. 2015;6:10008. - PMC - PubMed

Publication types

LinkOut - more resources