Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 18;11 Suppl 1(Suppl 1):S21.
doi: 10.1186/1471-2105-11-S1-S21.

Learning to predict expression efficacy of vectors in recombinant protein production

Affiliations

Learning to predict expression efficacy of vectors in recombinant protein production

Wen-Ching Chan et al. BMC Bioinformatics. .

Abstract

Background: Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.

Results: In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.

Conclusion: In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hierarchical structure comprising three expression levels in SDS-PAGE. The hierarchical structure consists of three expression levels; i.e., soluble fraction, inclusion fraction, and non-expression, in SDS-PAGE.
Figure 2
Figure 2
Six fusion vectors used in HTP systems. Cloning and expression regions of the six expression vectors and corresponding insertion locations of target proteins were used in this work. Recombinant fusion vectors were named by six fusion tags; i.e., (A) calmodulin-binding peptide (CBP), (B) glutathione S-transferase (GST), (C) N utilization substance A (NusA), (D) Histidine (His), (E) maltose-binding protein (MBP), and (F) thioredoxin (Trx).
Figure 3
Figure 3
Comparative analysis with PRC curves. The PRC Curves from the comparative analysis of solubility prediction methods.
Figure 4
Figure 4
Comparative analysis with ROC curves. The ROC Curves of the comparative analysis of solubility prediction methods.
Figure 5
Figure 5
Multiple sequence alignment for one mutation case. The result of Multiple Sequence Alignment (MSA) for one case which was predicted as a soluble form after mutating 5 nucleotides was presented.

Similar articles

Cited by

References

    1. Baneyx F. Recombinant protein expression in Escherichia coli. Curr Opin Biotechnol. 1999;11(5):411–21. doi: 10.1016/S0958-1669(99)00003-8. - DOI - PubMed
    1. Jonasson P, Liljeqvist S, Nygren PA, Stahl S. Genetic design for facilitated production and recovery of recombinant proteins in Escherichia coli. Biotechnol Appl Biochem. 2002;11(Pt 2):91–105. doi: 10.1042/BA20010099. - DOI - PubMed
    1. Sorensen HP, Mortensen KK. Advanced genetic strategies for recombinant protein expression in Escherichia coli. J Biotechnol. 2005;11(2):113–28. doi: 10.1016/j.jbiotec.2004.08.004. - DOI - PubMed
    1. Shih YP, Kung WM, Chen JC, Yeh CH, Wang AH, Wang TF. High-throughput screening of soluble recombinant proteins. Protein Sci. 2002;11(7):1714–9. doi: 10.1110/ps.0205202. - DOI - PMC - PubMed
    1. Shih YP, Wu HC, Hu SM, Wang TF, Wang AH. Self-cleavage of fusion protein in vivo using TEV protease to yield native protein. Protein Sci. 2005;11(4):936–41. doi: 10.1110/ps.041129605. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources