Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning

Alexander A Vinogradov¹, Jun Shi Chang¹, Hiroyasu Onaka^{2

3}, Yuki Goto¹, Hiroaki Suga¹

Affiliations

¹ Department of Chemistry, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan.
² Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan.
³ Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan.

PMID: 35756369
PMCID: PMC9228559
DOI: 10.1021/acscentsci.2c00223

Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning

Alexander A Vinogradov et al. ACS Cent Sci. 2022.

. 2022 Jun 22;8(6):814-824.

doi: 10.1021/acscentsci.2c00223. Epub 2022 May 26.

Authors

Alexander A Vinogradov¹, Jun Shi Chang¹, Hiroyasu Onaka^{2

3}, Yuki Goto¹, Hiroaki Suga¹

Affiliations

¹ Department of Chemistry, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan.
² Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan.
³ Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan.

PMID: 35756369
PMCID: PMC9228559
DOI: 10.1021/acscentsci.2c00223

Abstract

Promiscuous post-translational modification (PTM) enzymes often display nonobvious substrate preferences by acting on diverse yet well-defined sets of peptides and/or proteins. Understanding of substrate fitness landscapes for PTM enzymes is important in many areas of contemporary science, including natural product biosynthesis, molecular biology, and biotechnology. Here, we report an integrated platform for accurate profiling of substrate preferences for PTM enzymes. The platform features (i) a combination of mRNA display with next-generation sequencing as an ultrahigh throughput technique for data acquisition and (ii) deep learning for data analysis. The high accuracy (>0.99 in each of two studies) of the resulting deep learning models enables comprehensive analysis of enzymatic substrate preferences. The models can quantify fitness across sequence space, map modification sites, and identify important amino acids in the substrate. To benchmark the platform, we performed profiling of a Ser dehydratase (LazBF) and a Cys/Ser cyclodehydratase (LazDEF), two enzymes from the lactazole biosynthesis pathway. In both studies, our results point to complex enzymatic preferences, which, particularly for LazBF, cannot be reduced to a set of simple rules. The ability of the constructed models to dissect such complexity suggests that the developed platform can facilitate a wider study of PTM enzymes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

**Figure 1**
An overview of the workflow for the profiling of LazBF substrate preferences. (a) Chemical reaction catalyzed by LazBF. (b) Schematic overview of mRNA display-based selection/antiselection setups. For the full protocol, see Supporting Information 2.3. Ⓟ refers to the puromycin linker used to display the peptides onto cognate mRNAs. Both selection and antiselection assays can be repeated several times to produce libraries of progressively increasing (or decreasing) substrate fitness. (c) Schematic overview of the data analysis pipeline. NGS selection and antiselection data sets are parsed, preprocessed, and labeled. Peptides are represented as positionally encoded matrices of ECFPs, and a supervised CNN classifier is trained on the resulting data to produce models of LazBF substrate preferences. For a complete description of the data analysis pipeline, see Supporting Information 2.5.

**Figure 2**
mRNA display profiling of LazBF leads to enriched peptide populations suitable for deep learning applications. (a, b) Summary of the selection (a) and antiselection (b) experiments. Plotted are respective DNA recovery and enrichment values measured by qPCR after every round of mRNA display. (c) Data set convergence at the amino acid level as measured by log₂Y* scores. Amino acid aa in position *pos* is enriched in the selection data set compared to the antiselection one if log₂Y*_aa, pos is greater than 0. See also the definitions in the figure header and Supporting Information 2.1; c_aa, pos is the number of NGS reads with amino acid aa in position *pos* in a data set. (d) CNN classifier accuracy as a function of the number of mRNA display rounds. The models were trained on 4.75 × 10⁵ samples from the respective data sets, using 0.25 × 10⁵ unseen samples for validation. Multiple rounds of mRNA display lead to cleaner data sets and, hence, more accurate models. (e) CNN classifier accuracy as a function of the training data set size. The models were trained on round 6 data. Model accuracy scales with the size of the training data set. (f) Validation of model predictions against experimental data. 65 validation peptides (bVP1–65; all encoded in library 5S5; see also Table S4) were expressed by the FIT system and treated with LazBF/GluRS/tRNA^Glu for 2 h. Reaction outcomes were analyzed by LC-MS as described in Supporting Information 2.8. Model predictions showed good agreement with the experiment.

**Figure 3**
Model enables high-level analysis of LazBF substrate fitness landscapes. (a) Experimentally measured modification efficiencies of validation peptides (bVP1–65; Table S4) as a function of their S scores. S scores cannot be used to reliably predict fitness of bVP peptides. (b) Distribution of model predictions in the S-space. Substrate fitness of 5 × 10⁶ random 5S5 peptides was evaluated with the model. Plotted are binned statistics of model predictions in the S-space. The overall distribution of the peptides in the same space is displayed for reference. The analysis reveals that at best S scores can be reliably used as antideterminants of substrate fitness (when S < −5). (c) Pairwise epistasis between variable positions in the CP of 5S5 peptides. The model was utilized to compute abs (*epi*) scores using predictions for 5 × 10⁶ sequences from b). The resulting values can be used to estimate how strongly amino acids in the substrate affect each other’s fitness. Higher values correspond to stronger second-order effects. See Supporting Information 2.1 for computation details. (d) Analysis of epistatic interactions in bVP33. Average model calls were computed for 2 × 10⁴ partially random in silico generated peptides in each case; “x” denotes any amino acid except Ser. (e) Visualization of all pairwise epistatic interactions in bVP33. Strong epistasis inside the His4-Pro5-Ser6-Arg7-Trp8 motif contributes to the high fitness of the peptide.

**Figure 4**
Model-guided dissection of the substrate preferences of LazBF. (a) LC-MS analysis of bVP37 dehydration by LazBF [a broad extracted ion chromatogram (^brEIC) and a composite MS spectrum integrated over substrate-derived peaks showing the overall product distribution; see Supporting Information 2.8 for LC-MS details]. (b) Atom- and bond-wise accumulated IG attributions for bVP37. The model suggests that Ser10 is the primary determinant of the high modification efficiency. (c) A zoomed-in section of a charge-deconvoluted CID fragmentation spectrum for singly dehydrated bVP37; y-ion assignments and neutral molecule losses are omitted for clarity. The spectrum allows unambiguous assignment of the dehydration site to Ser10, consistent with the model’s suggestion. See Figures S10–12 for more examples. (d) Amino acid-wise IGs provide an intuition for relative amino acid contributions to the total substrate fitness. Experimentally measured increase in modification efficiency for three single-point mutants of bVP32, 36, and 58 underscores the model’s ability to identify amino acids critical for LazBF-mediated dehydration. See Figure S13 for more examples. (e) Substrate space traversal study for bVP29 (see also the accompanying text). The model was employed to find a sequence of bVP29 mutants which alter the substrate fitness at each step. The route identified by the model was validated experimentally. Collectively, this study points to the complex and unintuitive substrate preferences of LazBF.

**Figure 5**
Substrate specificity profiling for LazDEF. (a) Chemical reactions catalyzed by LazDEF. (b) Design of the LazDEF substrate library, library 6C6. (c) Summary of the selection and antiselection experiments. Plotted are respective DNA recovery and enrichment values measured by qPCR after every round of mRNA display. (d) CNN classifier accuracy as a function of training data set size. The models were trained on round 5 data. (e) Validation of model predictions against experimental data. A total of 64 validation peptides (dVP1–64; Table S5) were expressed by the FIT system and treated with LazDEF for 5 h. Reaction outcomes were analyzed by LC-MS as described in Supporting Information 2.8. Model predictions show good agreement with the experiment. (f) Pairwise epistasis between variable positions in the CP of 6C6 peptides. The model was utilized to compute abs(*epi*) scores using predictions for 5 × 10⁶ sequences from panel h). The resulting values can be used to estimate how strongly amino acids in the substrate affect each other’s fitness. Higher values correspond to stronger second-order effects. Compared to the results for LazBF, LazDEF substrates are characterized by weaker pairwise epistatic interactions, which aids in explaining the results in panels (g) and (h). See Supporting Information 2.1 for computation details. (g) Experimentally measured modification efficiencies of validation peptides as a function of their S scores. Compared to the LazBF results (Figure 3a), the S scores for LazDEF substrates prove more informative. (h) Distribution of model predictions in the S-space. Substrate fitness of 5 × 10⁶ random 6C6 peptides was evaluated with the model. Plotted are binned statistics of model predictions in the S-space. The overall distribution of the peptides in the same space is displayed for reference. In the interval [−3, 2], which accounts for 46% of the total peptide space, S scores are an unreliable metric of substrate fitness.

See this image and copyright information in PMC

References

1. Arnison P. G.; Bibb M. J.; Bierbaum G.; Bowers A. A.; Bugni T. S.; Bulaj G.; Camarero J. A.; Campopiano D. J.; Challis G. L.; Clardy J.; et al. Ribosomally Synthesized and Post-Translationally Modified Peptide Natural Products: Overview and Recommendations for a Universal Nomenclature. Nat. Prod. Rep. 2013, 30, 108–160. 10.1039/C2NP20085F. - DOI - PMC - PubMed
1. Montalbán-López M.; Scott T. A.; Ramesh S.; Rahman I. R.; van Heel A. J.; Viel J. H.; Bandarian V.; Dittmann E.; Genilloud O.; Goto Y.; et al. New Developments in RiPP Discovery, Enzymology and Engineering. Nat. Prod. Rep. 2021, 38, 130–239. 10.1039/D0NP00027B. - DOI - PMC - PubMed
1. Repka L. M.; Chekan J. R.; Nair S. K.; van der Donk W. A. Mechanistic Understanding of Lanthipeptide Biosynthetic Enzymes. Chem. Rev. 2017, 117, 5457–5520. 10.1021/acs.chemrev.6b00591. - DOI - PMC - PubMed
1. Hegemann J. D.; Süssmuth R. D. Matters of Class: Coming of Age of Class III and IV Lanthipeptides. RSC Chem. Biol. 2020, 1, 110–127. 10.1039/D0CB00073F. - DOI - PMC - PubMed
1. Sivonen K.; Leikoski N.; Fewer D. P.; Jokela J. Cyanobactins-Ribosomal Cyclic Peptides Produced by Cyanobacteria. Appl. Microbiol. Biotechnol. 2010, 86 (5), 1213–1225. 10.1007/s00253-010-2482-x. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning

Affiliations

Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources