. 2007 Mar 26:7:16.

doi: 10.1186/1472-6750-7-16.

Engineering proteinase K using machine learning and synthetic genes

Jun Liao¹, Manfred K Warmuth, Sridhar Govindarajan, Jon E Ness, Rebecca P Wang, Claes Gustafsson, Jeremy Minshull

Affiliations

PMID: 17386103
PMCID: PMC1847811
DOI: 10.1186/1472-6750-7-16

Engineering proteinase K using machine learning and synthetic genes

Jun Liao et al. BMC Biotechnol. 2007.

. 2007 Mar 26:7:16.

doi: 10.1186/1472-6750-7-16.

Authors

Jun Liao¹, Manfred K Warmuth, Sridhar Govindarajan, Jon E Ness, Rebecca P Wang, Claes Gustafsson, Jeremy Minshull

Affiliation

¹ Department of Computer Science, University of California, Santa Cruz, CA 95064, USA. liaojun@soe.ucsc.edu

PMID: 17386103
PMCID: PMC1847811
DOI: 10.1186/1472-6750-7-16

Abstract

Background: Altering a protein's function by changing its sequence allows natural proteins to be converted into useful molecular tools. Current protein engineering methods are limited by a lack of high throughput physical or computational tests that can accurately predict protein activity under conditions relevant to its final application. Here we describe a new synthetic biology approach to protein engineering that avoids these limitations by combining high throughput gene synthesis with machine learning-based design algorithms.

Results: We selected 24 amino acid substitutions to make in proteinase K from alignments of homologous sequences. We then designed and synthesized 59 specific proteinase K variants containing different combinations of the selected substitutions. The 59 variants were tested for their ability to hydrolyze a tetrapeptide substrate after the enzyme was first heated to 68 degrees C for 5 minutes. Sequence and activity data was analyzed using machine learning algorithms. This analysis was used to design a new set of variants predicted to have increased activity over the training set, that were then synthesized and tested. By performing two cycles of machine learning analysis and variant design we obtained 20-fold improved proteinase K variants while only testing a total of 95 variant enzymes.

Conclusion: The number of protein variants that must be tested to obtain significant functional improvements determines the type of tests that can be performed. Protein engineers wishing to modify the property of a protein to shrink tumours or catalyze chemical reactions under industrial conditions have until now been forced to accept high throughput surrogate screens to measure protein properties that they hope will correlate with the functionalities that they intend to modify. By reducing the number of variants that must be tested to fewer than 100, machine learning algorithms make it possible to use more complex and expensive tests so that only protein properties that are directly relevant to the desired application need to be measured. Protein design algorithms that only require the testing of a small number of variants represent a significant step towards a generic, resource-optimized protein engineering process.

PubMed Disclaimer

Figures

**Figure 2**
**Three cycles of proteinase K variant design and testing**. Mean activity measurements of the 3 sets of proteinase K variants are shown. Set 1 (diamonds) is the initial set of 59 variants. Set 2 (squares, 20 variants) was designed using the activities of Set 1. Set 3 (triangles, 16 variants) was designed based on sets 1 and 2. Activities towards N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilide were measured at 37°C following a 5 minutes heat treatment of the enzyme at 68°C. Activities are expressed relative to the mean activity of 2 replicates of the wild-type proteinase K.

**Figure 3**
**Substitution weight mean and standard deviation values produced by the MR algorithm**. We created 1000 subsamples of the training set (the sequences and non-zero activities of variants from sets 1 and 2) by leaving out 5 randomly selected variants from each subsample. A: The MR (matching loss) algorithm was used to calculate substitution weights for each subsample. The mean values from the 1000 subsamples are indicated by horizontal notches. Error bars represent one standard deviation of the 1000 calculated substitution weights. Substitutions are indicated below the graph with the number of occurrences in the training set in parentheses. Each substitution is described by a single weight. Variant 3–4 was designed to include all substitutions with positive mean weight that occur at least 3 times in the training set (red and blue circles). Note that substitution Y194S (green circle) was not selected since it occurred less than 3 times in the training set. Variant 3–9 included all substitutions that occurred at least 3 times and whose mean weight was at least one standard deviation above zero (red circles only). Substitution weights calculated from the entire dataset instead of the mean of 1000 subsamples are shown as purple circles. B: The MR algorithm was used to calculate substitution weights as in A, except that models were tested by expanding each pair in turn into 4 terms and selecting the pair that most improved the model. In this example each substitution is described by a single weight except for the 3 pairs (132,208), (337,355), (267,293) which are modeled by 4 weights each. Re d circles indicate the substitutions selected to design variant 3–14. Note that substitution combination I132V 208K was not selected since it occurred less than 3 times in the training set.

**Figure 4**
**Activities of variants designed using substitution weights**. Activities towards N-Succinyl-Ala-Ala-Pro-Leu p-nitroanilide were measured at 37°C following a 5 minute heat treatment of the enzyme at 68°C. Activities are expressed relative to the mean activity of duplicates of wild-type proteinase K. Error bars represent one standard deviation of the activity measurements. Variants are grouped according to the machine learning algorithm used to calculate substitution weights (indicated below each group), and are compared with the best variants from the initial design set (variants 1–40 and 1–50 black bars, on the left). The first design (yellow bars, design method G in Additional file 2) of each group belongs to set 2. We included a substitution in the design if it occurred at least three times in the training set and its mean weight was at least one standard deviation above zero. All remaining designs in each group belong to set 3. The second in each group (green bars, design method J in Additional file 2) includes substitutions occurring at least three times and whose mean weights were merely positive (eg Figure 3A, red and blue circles). The third in each group (red bars, design method K in Additional file 2) contained all substitutions occurring at least three times and whose mean weight was at least one standard deviation above zero (eg Figure 3A, red circles). Note that this third design in each group is always better than the second. The last variant(s) in each group (blue bars, design method L in Additional file 2) were designed by modeling interdependent substitutions (eg Figure 3B, red circles).

**Figure 5**
**Machine learning design compared with random choices and "expert" designs**. Distribution of activities of 4 sets of variants designed using different methods are shown. Set A (white bars, variants 1–2, 1–6, 1–12, 1–13 and 1–34 to 1–49, total of 20 variants) contain arbitrarily selected combinations of 3, 5 or 6 substitutions. Set B (light shading, variants 1–50 to 1–59, total of 10 variants) were designed by manual analysis of the sequence and activity data from variants 1 through 49. Set C (dark shading, variants 2-1 to 2–20, total of 20 variants) were designed using machine learning algorithms based on the data from variants 1 through 59. Set D (black fill, variants 3-1 to 3–16, total of 16 variants) were designed using machine learning algorithms based on the data from variants 1-1 through 1–59 and 2-1 through 2–20.

**Figure 6**
**Increases in proteinase K activity with and without heating**. Proteinase K variants were tested from triplicate independent cultures for activity after heating at 68°C for different times: unheated (circles), 2.5 minutes (squares), 5 minutes (crosses), 7.5 minutes (triangles), 10 minutes (diamonds) and 15 minutes (open squares). A: absorbance at 405 nm of substrate incubated with wild type proteinase K, B: absorbance at 405 nm of substrate incubated with variant 3–9.

**Figure 7**
**Changes in activity and half-life in designed protein variants**. Activity (unheated) and half life were calculated for 13 protein variants and wild type proteinase K. The activity without heating was calculated from the initial slopes of the A₄₀₅measurements without heating (white bars), examples shown in Figure 6. The half-life at 68°C (shaded bars) was calculated using the initial slopes after different heating times and fitting to an exponential curve. Error bars represent one standard deviation of the experimental measurements. The wild-type values are shown on the left. The substitutions of each variant are given in the column below the variant name. Only 10 of the 19 positions are shown. In the remaining 9 positions, all variants contained amino acids from the wild-type sequence.

See this image and copyright information in PMC

Cited by

Engineering genes for predictable protein expression.
Gustafsson C, Minshull J, Govindarajan S, Ness J, Villalobos A, Welch M. Gustafsson C, et al. Protein Expr Purif. 2012 May;83(1):37-46. doi: 10.1016/j.pep.2012.02.013. Epub 2012 Mar 8. Protein Expr Purif. 2012. PMID: 22425659 Free PMC article. Review.
Selection of target-binding proteins from the information of weakly enriched phage display libraries by deep sequencing and machine learning.
Ito T, Nguyen TD, Saito Y, Kurumida Y, Nakazawa H, Kawada S, Nishi H, Tsuda K, Kameda T, Umetsu M. Ito T, et al. MAbs. 2023 Jan-Dec;15(1):2168470. doi: 10.1080/19420862.2023.2168470. MAbs. 2023. PMID: 36683172 Free PMC article.
DisCoTune: versatile auxiliary plasmids for the production of disulphide-containing proteins and peptides in the E. coli T7 system.
Bertelsen AB, Hackney CM, Bayer CN, Kjelgaard LD, Rennig M, Christensen B, Sørensen ES, Safavi-Hemami H, Wulff T, Ellgaard L, Nørholm MHH. Bertelsen AB, et al. Microb Biotechnol. 2021 Nov;14(6):2566-2580. doi: 10.1111/1751-7915.13895. Epub 2021 Aug 18. Microb Biotechnol. 2021. PMID: 34405535 Free PMC article.
Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field.
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Villalobos-Alva J, et al. Front Bioeng Biotechnol. 2022 Jul 7;10:788300. doi: 10.3389/fbioe.2022.788300. eCollection 2022. Front Bioeng Biotechnol. 2022. PMID: 35875501 Free PMC article.
Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution.
Shanker VR, Bruun TUJ, Hie BL, Kim PS. Shanker VR, et al. bioRxiv [Preprint]. 2023 Dec 21:2023.12.19.572475. doi: 10.1101/2023.12.19.572475. bioRxiv. 2023. Update in: Science. 2024 Jul 5;385(6704):46-53. doi: 10.1126/science.adk8946. PMID: 38187780 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. van Regenmortel MH. Are there two distinct research strategies for developing biologically active molecules: rational design and empirical selection? J Mol Recognit. 2000;13:1–4. doi: 10.1002/(SICI)1099-1352(200001/02)13:1<1::AID-JMR490>3.0.CO;2-W. - DOI - PubMed
1. Ryu DD, Nam DH. Recent progress in biomolecular engineering. Biotechnol Prog. 2000;16:2–16. doi: 10.1021/bp088059d. - DOI - PubMed
1. Tobin MB, Gustafsson C, Huisman GW. Directed evolution: the 'rational' basis for 'irrational' design. Curr Opinion on Structural Biology. 2000;10:421–427. doi: 10.1016/S0959-440X(00)00109-3. - DOI - PubMed
1. Korkegian A, Black ME, Baker D, Stoddard BL. Computational thermostabilization of an enzyme. Science. 2005;308:857–860. doi: 10.1126/science.1107387. - DOI - PMC - PubMed
1. Dwyer MA, Looger LL, Hellinga HW. Computational design of a biologically active enzyme. Science. 2004;304:1967–1971. doi: 10.1126/science.1098432. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Engineering proteinase K using machine learning and synthetic genes

Affiliation

Engineering proteinase K using machine learning and synthetic genes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources