A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli
- PMID: 24885721
- PMCID: PMC4098780
- DOI: 10.1186/1471-2105-15-134
A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli
Abstract
Background: Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.
Results: This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end.
Conclusions: This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.
Similar articles
-
Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction.Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review.
-
Learning to predict expression efficacy of vectors in recombinant protein production.BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S21. doi: 10.1186/1471-2105-11-S1-S21. BMC Bioinformatics. 2010. PMID: 20122193 Free PMC article.
-
Prediction of recombinant protein overexpression in Escherichia coli using a machine learning based model (RPOLP).Comput Biol Med. 2015 Nov 1;66:330-6. doi: 10.1016/j.compbiomed.2015.09.015. Epub 2015 Sep 30. Comput Biol Med. 2015. PMID: 26476414
-
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition.BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S3. doi: 10.1186/1471-2105-13-S17-S3. Epub 2012 Dec 13. BMC Bioinformatics. 2012. PMID: 23282103 Free PMC article.
-
Strategies for the production of recombinant protein in Escherichia coli.Protein J. 2013 Aug;32(6):419-25. doi: 10.1007/s10930-013-9502-5. Protein J. 2013. PMID: 23897421 Review.
Cited by
-
Identifying immunodominant multi-epitopes from the envelope glycoprotein of the Lassa mammarenavirus as vaccine candidate for Lassa fever.Clin Exp Vaccine Res. 2022 Sep;11(3):249-263. doi: 10.7774/cevr.2022.11.3.249. Epub 2022 Sep 30. Clin Exp Vaccine Res. 2022. PMID: 36451670 Free PMC article.
-
A guide to machine learning for bacterial host attribution using genome sequence data.Microb Genom. 2019 Dec;5(12):e000317. doi: 10.1099/mgen.0.000317. Microb Genom. 2019. PMID: 31778355 Free PMC article.
-
Alpha-tubulin enhanced renal tubular cell proliferation and tissue repair but reduced cell death and cell-crystal adhesion.Sci Rep. 2016 Jul 1;6:28808. doi: 10.1038/srep28808. Sci Rep. 2016. PMID: 27363348 Free PMC article.
-
Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods.Protein J. 2024 Oct;43(5):983-996. doi: 10.1007/s10930-024-10230-z. Epub 2024 Sep 7. Protein J. 2024. PMID: 39243320
-
Solubility of proteins.ADMET DMPK. 2020 Jun 28;8(4):391-399. doi: 10.5599/admet.831. eCollection 2020. ADMET DMPK. 2020. PMID: 35300195 Free PMC article.
References
-
- Hirose S, Kawamura Y, Yokota K, Kuroita T, Natsume T, Komiya K, Tsutsumi T, Suwa Y, Isogai T, Goshima N, Noguchi T. Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. J Biochem. 2011;150(1):73–81. doi: 10.1093/jb/mvr042. - DOI - PubMed
-
- Samak T, Gunter D, Wan Z. Prediction of Protein Solubility in E. coli. Chicago, IL: E-Science (e-Science), 2012 IEEE 8th International Conference on Date of Conference: 8-12 Oct. 2012; 2012. pp. 1–8.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources