Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE
- PMID: 36694239
- PMCID: PMC9875434
- DOI: 10.1186/s12915-023-01510-8
Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE
Abstract
Background: Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work.
Results: In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects.
Conclusions: DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .
Keywords: Feature embedding; Interpretation; Machine learning; Protein solubility.
© 2023. The Author(s).
Conflict of interest statement
The authors declare that they have no competing interests.
Figures




Similar articles
-
DeepAc4C: a convolutional neural network model with hybrid features composed of physicochemical patterns and distributed representation information for identification of N4-acetylcytidine in mRNA.Bioinformatics. 2021 Dec 22;38(1):52-57. doi: 10.1093/bioinformatics/btab611. Bioinformatics. 2021. PMID: 34427581
-
Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction.Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review.
-
DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks.Int J Mol Sci. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555. Int J Mol Sci. 2021. PMID: 34948354 Free PMC article.
-
Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning.IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):967-975. doi: 10.1109/TCBB.2022.3204365. Epub 2023 Apr 3. IEEE/ACM Trans Comput Biol Bioinform. 2023. PMID: 36063523
-
A comprehensive review of the imbalance classification of protein post-translational modifications.Brief Bioinform. 2021 Sep 2;22(5):bbab089. doi: 10.1093/bib/bbab089. Brief Bioinform. 2021. PMID: 33834199 Review.
Cited by
-
PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.Brief Bioinform. 2024 Jul 25;25(5):bbae404. doi: 10.1093/bib/bbae404. Brief Bioinform. 2024. PMID: 39179250 Free PMC article.
-
Protein engineering in the computational age: An open source framework for exploring mutational landscapes in silico.Eng Biol. 2023 Dec 7;7(1-4):29-38. doi: 10.1049/enb2.12028. eCollection 2023 Dec. Eng Biol. 2023. PMID: 38094241 Free PMC article.
-
GRACE: Generative Redesign in Artificial Computational Enzymology.ACS Synth Biol. 2024 Dec 20;13(12):4154-4164. doi: 10.1021/acssynbio.4c00624. Epub 2024 Nov 8. ACS Synth Biol. 2024. PMID: 39513550 Free PMC article.
-
MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer.Front Microbiol. 2023 Aug 22;14:1238199. doi: 10.3389/fmicb.2023.1238199. eCollection 2023. Front Microbiol. 2023. PMID: 37675425 Free PMC article.
-
The Convergence of Radiology and Genomics: Advancing Breast Cancer Diagnosis with Radiogenomics.Cancers (Basel). 2024 Mar 6;16(5):1076. doi: 10.3390/cancers16051076. Cancers (Basel). 2024. PMID: 38473432 Free PMC article. Review.
References
-
- Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology. 1991;9(5):443–448. - PubMed
-
- Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. In: Kornberg RD, editor. Annu Rev Biochem. 2017. pp. 27–68. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous