Learning to predict expression efficacy of vectors in recombinant protein production
- PMID: 20122193
- PMCID: PMC3009492
- DOI: 10.1186/1471-2105-11-S1-S21
Learning to predict expression efficacy of vectors in recombinant protein production
Abstract
Background: Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.
Results: In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.
Conclusion: In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.
Figures





Similar articles
-
A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli.BMC Bioinformatics. 2014 May 8;15:134. doi: 10.1186/1471-2105-15-134. BMC Bioinformatics. 2014. PMID: 24885721 Free PMC article.
-
Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction.Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review.
-
Escherichia coli as a versatile cell factory: Advances and challenges in recombinant protein production.Protein Expr Purif. 2024 Jul;219:106463. doi: 10.1016/j.pep.2024.106463. Epub 2024 Mar 12. Protein Expr Purif. 2024. PMID: 38479588 Review.
-
A family of E. coli expression vectors for laboratory scale and high throughput soluble protein production.BMC Biotechnol. 2006 Mar 1;6:12. doi: 10.1186/1472-6750-6-12. BMC Biotechnol. 2006. PMID: 16509985 Free PMC article.
-
Optimization of culture parameters and novel strategies to improve protein solubility.Methods Mol Biol. 2015;1258:45-63. doi: 10.1007/978-1-4939-2205-5_3. Methods Mol Biol. 2015. PMID: 25447858 Review.
Cited by
-
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition.BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S3. doi: 10.1186/1471-2105-13-S17-S3. Epub 2012 Dec 13. BMC Bioinformatics. 2012. PMID: 23282103 Free PMC article.
-
A Novel Strategy to Identify Endolysins with Lytic Activity against Methicillin-Resistant Staphylococcus aureus.Int J Mol Sci. 2023 Mar 17;24(6):5772. doi: 10.3390/ijms24065772. Int J Mol Sci. 2023. PMID: 36982851 Free PMC article.
-
A comprehensive in silico characterization of bacterial signal peptides for the excretory production of Anabaena variabilis phenylalanine ammonia lyase in Escherichia coli.3 Biotech. 2018 Dec;8(12):488. doi: 10.1007/s13205-018-1517-3. Epub 2018 Nov 16. 3 Biotech. 2018. PMID: 30498661 Free PMC article.
-
Improving protein solubility and activity by introducing small peptide tags designed with machine learning models.Metab Eng Commun. 2020 Jun 22;11:e00138. doi: 10.1016/j.mec.2020.e00138. eCollection 2020 Dec. Metab Eng Commun. 2020. PMID: 32642423 Free PMC article.
-
DeepSol: a deep learning framework for sequence-based protein solubility prediction.Bioinformatics. 2018 Aug 1;34(15):2605-2613. doi: 10.1093/bioinformatics/bty166. Bioinformatics. 2018. PMID: 29554211 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources