Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 8:15:134.
doi: 10.1186/1471-2105-15-134.

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Affiliations

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Narjeskhatoon Habibi et al. BMC Bioinformatics. .

Abstract

Background: Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.

Results: This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end.

Conclusions: This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.

PubMed Disclaimer

Similar articles

Cited by

References

    1. Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN. Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinform. 2010;11(Suppl 1):S21. doi: 10.1186/1471-2105-11-S1-S21. - DOI - PMC - PubMed
    1. van den Berg BA, Reinders MJ, Hulsman M, Wu L, Pel HJ, Roubos JA, de Ridder D. Exploring sequence characteristics related to high-level production of secreted proteins in aspergillus Niger. PLoS One. 2012;7(10):e45869. doi: 10.1371/journal.pone.0045869. - DOI - PMC - PubMed
    1. Hirose S, Kawamura Y, Yokota K, Kuroita T, Natsume T, Komiya K, Tsutsumi T, Suwa Y, Isogai T, Goshima N, Noguchi T. Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. J Biochem. 2011;150(1):73–81. doi: 10.1093/jb/mvr042. - DOI - PubMed
    1. Samak T, Gunter D, Wan Z. Prediction of Protein Solubility in E. coli. Chicago, IL: E-Science (e-Science), 2012 IEEE 8th International Conference on Date of Conference: 8-12 Oct. 2012; 2012. pp. 1–8.
    1. Fang Y, Fang J. Discrimination of soluble and aggregation-prone proteins based on sequence information. Mol BioSyst. 2013;9(4):806–811. doi: 10.1039/c3mb70033j. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances