Predicting novel substrates for enzymes with minimal experimental effort with active learning

Dante A Pertusi¹, Matthew E Moura¹, James G Jeffryes², Siddhant Prabhu¹, Bradley Walters Biggs¹, Keith E J Tyo³

Affiliations

¹ Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, United States.
² Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, United States; Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, United States.
³ Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, United States. Electronic address: k-tyo@northwestern.edu.

PMID: 29030274
PMCID: PMC7055960
DOI: 10.1016/j.ymben.2017.09.016

Predicting novel substrates for enzymes with minimal experimental effort with active learning

Dante A Pertusi et al. Metab Eng. 2017 Nov.

. 2017 Nov:44:171-181.

doi: 10.1016/j.ymben.2017.09.016. Epub 2017 Oct 10.

Authors

Dante A Pertusi¹, Matthew E Moura¹, James G Jeffryes², Siddhant Prabhu¹, Bradley Walters Biggs¹, Keith E J Tyo³

Affiliations

¹ Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, United States.
² Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, United States; Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, United States.
³ Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, United States. Electronic address: k-tyo@northwestern.edu.

PMID: 29030274
PMCID: PMC7055960
DOI: 10.1016/j.ymben.2017.09.016

Abstract

Enzymatic substrate promiscuity is more ubiquitous than previously thought, with significant consequences for understanding metabolism and its application to biocatalysis. This realization has given rise to the need for efficient characterization of enzyme promiscuity. Enzyme promiscuity is currently characterized with a limited number of human-selected compounds that may not be representative of the enzyme's versatility. While testing large numbers of compounds may be impractical, computational approaches can exploit existing data to determine the most informative substrates to test next, thereby more thoroughly exploring an enzyme's versatility. To demonstrate this, we used existing studies and tested compounds for four different enzymes, developed support vector machine (SVM) models using these datasets, and selected additional compounds for experiments using an active learning approach. SVMs trained on a chemically diverse set of compounds were discovered to achieve maximum accuracies of ~80% using ~33% fewer compounds than datasets based on all compounds tested in existing studies. Active learning-selected compounds for testing resolved apparent conflicts in the existing training data, while adding diversity to the dataset. The application of these algorithms to wide arrays of metabolic enzymes would result in a library of SVMs that can predict high-probability promiscuous enzymatic reactions and could prove a valuable resource for the design of novel metabolic pathways.

Keywords: Active learning; Enzyme promiscuity; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Statement

The authors declare that they have no conflict of interest.

Figures

**Figure 1.. Challenges in characterizing enzyme promiscuity.**
Existing datasets of active compounds (green squares) and inactive compounds (red triangles) describing substrate-level enzyme promiscuity often consist of (A) narrow distributions in chemical space or (B) a small number of compounds. (C) In the absence of negative data, randomly selected compounds used in its stead can be widely dispersed, leading to a high false positive rate when used to train SVMs due to high uncertainty in the position of the decision surface (dashed lines). (D) By comparison, confirmed inactive data near the decision surface allows for less uncertainty in calculating an optimal separating hyperplane.

**Figure 2.. Representative reaction schemes for the enzymes analyzed in this study.**
Schemes for (A) MenD and (B) Car are conserved for each reaction known to be catalyzed by these enzymes. The reactions catalyzed by (C) AAEH and (D) HAPMO allow for more structural diversity. In particular, AAEH can also cleave at a peptide bond with similar local structure to the one in this figure, and HAPMO may oxygenate at sites in a cyclic aliphatic system adjacent to a carbonyl.

**Figure 3.. Compounds selected by scientists are not necessarily diverse.**
tSNEs for (A) Car and (B) MenD. Within each set, there are multiple distinct portions of chemical space represented, yet the existing datasets do not capture the diversity inherent in biologically relevant chemical space. Active compounds in each set are represented by green +, inactive compounds by red −, and untested compounds by grey circles.

**Figure 4.. Active learning improves model accuracy with significantly fewer compounds.**
Learning curves for SVM models of (A) MenD and (B) AAEH. In both cases, the maximum accuracy of the classifier is reached when selecting compounds using active learning. Error bars represent one standard deviation from the mean value of the accuracy score calculated across 1,000 iterations.

**Figure 5.. Active learning selects compounds that delineate how features impact activity.**
The MenD training set has large numbers of compounds that contain either an aldehyde group (pink) or a carbo-carbon double bond (blue), and a comparatively small number of compounds that contain both. ZINC aldehydes with carbon-carbon double bonds cannot be easily classified because there is insufficient training data to resolve the high correlation of aldehydes with active compounds and carbon-carbon double bonds with inactive compounds. Numbers indicate the number of compounds in each group (aldehyde, C-C double bond, or both).

See this image and copyright information in PMC

References

1. Akhtar MK, Turner NJ, Jones PR, 2013. Carboxylic acid reductase is a versatile enzyme for the conversion of fatty acids into fuels and chemical commodities. PNAS 110, 87–92. doi: 10.1073/pnas.1216516110 - DOI - PMC - PubMed
1. Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JES, Noeske T, 2014. Ligand-based target prediction with signature fingerprints. J. Chem. Inf. Model 54, 2647–2653. doi: 10.1021/ci500361u - DOI - PubMed
1. Biggs BW, Rouck JE, Kambalyal A, Arnold W, Lim CG, De Mey M, Oneil-Johnson M, Starks CM, Das A, Ajikumar PK, 2016. Orthogonal Assays Clarify the Oxidative Biochemistry of Taxol P450 CYP725A4. ACS Chem. Biol 11, 1445–1451. doi: 10.1021/acschembio.5b00968 - DOI - PubMed
1. Campodonico MA, Andrews BA, Asenjo JA, Palsson BO, Feist AM, 2014. Generation of an atlas for commodity chemical production in Escherichia coli and a novel pathway prediction algorithm, GEM-Path. Metab. Eng 25, 140–158. doi: 10.1016/j.ymben.2014.07.009 - DOI - PubMed
1. Carbonell P, Faulon J-L, 2010. Molecular signatures-based prediction of enzyme promiscuity. Bioinformatics 26, 2012–9. doi: 10.1093/bioinformatics/btq317 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

T32 GM008449/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting novel substrates for enzymes with minimal experimental effort with active learning

Affiliations

Predicting novel substrates for enzymes with minimal experimental effort with active learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources