FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion

Sicen Liu¹, Shutao Chen², Tao Bai^{2

3}, Bin Liu^{1

2

4}

Affiliations

¹ SMBU-MSU-BIT Joint Laboratory on Bioinformatics and Engineering Biology, Shenzhen MSU-BIT University, Shenzhen, Guangdong 518172, China.
² School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.
³ School of Mathematics and Computer Science, Yan'an University, Shaanxi 716000, China.
⁴ Zhongguancun Academy, Beijing 100094, China.

PMID: 40577786
PMCID: PMC12231546
DOI: 10.1093/bioinformatics/btaf362

FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion

Sicen Liu et al. Bioinformatics. 2025.

. 2025 Jul 1;41(7):btaf362.

doi: 10.1093/bioinformatics/btaf362.

Authors

Sicen Liu¹, Shutao Chen², Tao Bai^{2

3}, Bin Liu^{1

2

4}

Affiliations

¹ SMBU-MSU-BIT Joint Laboratory on Bioinformatics and Engineering Biology, Shenzhen MSU-BIT University, Shenzhen, Guangdong 518172, China.
² School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.
³ School of Mathematics and Computer Science, Yan'an University, Shaanxi 716000, China.
⁴ Zhongguancun Academy, Beijing 100094, China.

PMID: 40577786
PMCID: PMC12231546
DOI: 10.1093/bioinformatics/btaf362

Abstract

Motivation: Intrinsic disorder regions (IDRs) play a significant role in diverse biological processes and are widely distributed in proteins. Thus, accurately predicting these regions is essential for analyzing protein structure and function. Amino acid feature extraction servers as a foundational process in the development of computational predictive models. Existing methods typically rely on traditional biological features (e.g. PSSM) or use pre-trained protein language models (PPLMs) to capture sequence semantic information, often resorting to straightforward feature concatenation. However, these approaches fail to capture the multi-semantic interactions between traditional biological features and PPLMs-based features.

Results: In this study, we propose a method named FusionEncoder designed for the integration of traditional biological and PPLMs-based features of the protein. FusionEncoder is a fusion network built on a variant of long short-term memory (LSTM). We consider traditional biological features and PPLMs-based features to be two types of semantic inputs within a "multi-semantic" space. Traditional features are input into the cell state of the LSTM, while PPLMs-based features are fed into the input part. A fusion cell is then utilized to fuse these two types of features. This strategy leverages the capability of LSTM to encode long sequences, enhancing context-aware semantic learning of amino acid sequences. Finally, a transformer-based encoder layer is employed to predict the IDRs. Evaluation on four independent test datasets indicate that FusionEncoder obviously improves the accuracy of amino acid feature representation and achieves superior performance compared to the other existing methods.

Availability and implementation: To facilitate accessibility for experimental researchers, a user-friendly and publicly available webserver for the FusionEncoder predictor has been deployed at http://bliulab.net/FusionEncoder/. FusionEncoder is expected to serve as a valuable tool for the accurate identification of IDRs.

PubMed Disclaimer

Figures

**Figure. 1.**
The differences between existing protein sequence feature extraction methods and the proposed FusionEncoder method. (a) Methods that rely on traditional biological property feature extraction. (b) Methods that rely on PPLM for protein sequence feature extraction. (c) Methods that directly combine traditional biological property feature and PPLMs-based protein sequence feature. (d) Our proposed FusionEncoder method, in which we perform multi-semantic interaction on the feature space based on traditional biological property extraction and the feature space based on PPLM extraction and then perform protein sequence feature extraction.

**Figure 2.**
The framework of FusionEncoder. (a) The residue feature extraction process. This stage involves traditional biological feature extraction (PSSM, AAindex, and energy-based methods) and PPLMs-based feature extraction (ESM, ProtT5, DR-BERT, and OntoProtein methods). (b) Multi-semantic interaction layer. In this layer, traditional biological features and PPLMs-based features are fused using the FusionCell module, while long short-term memory (LSTM) facilitates information transfer between residues. (c) Encoder layer. A transformer-based encoder module is employed to encode protein sequences. (d) The output layer is utilized to perform predictions for IDRs.

**Figure 3.**
Ablation study of various feature combination on the validation dataset. To make the figure more compact, abbreviations were used to conserve space. P means PSSM, A means AAindex, e means energy, E2 means ESM2, T5 means Prot-T5, DR means DR-BERT, OP means OntoProtein, w/o means without.

**Figure 4.**
The t-SNE visualization of nine protein representations from DISORDER723 dataset.

See this image and copyright information in PMC

References

1. Altschul SF, Madden TL, Schäffer AA et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–402. - PMC - PubMed
1. Bahdanau D, Cho KH, Bengio Y. Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015.
1. Chen J, Guo M, Li S et al. ProtDec-LTR2. 0: an improved method for protein remote homology detection by combining pseudo protein and supervised learning to rank. Bioinformatics 2017;33:3473–6. - PubMed
1. Cheng H, Rao B, Liu L et al. PepFormer: end-to-end transformer-based siamese network to predict and enhance peptide detectability based on sequence only. Anal Chem 2021;93:6481–90. - PubMed
1. Cheng J, Sweredoski MJ, Baldi P. Accurate prediction of protein disordered regions by mining protein structure data. Data Min Knowl Discov 2005;11:213–22.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion

Affiliations

FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources