Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 22;8(6):814-824.
doi: 10.1021/acscentsci.2c00223. Epub 2022 May 26.

Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning

Affiliations

Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning

Alexander A Vinogradov et al. ACS Cent Sci. .

Abstract

Promiscuous post-translational modification (PTM) enzymes often display nonobvious substrate preferences by acting on diverse yet well-defined sets of peptides and/or proteins. Understanding of substrate fitness landscapes for PTM enzymes is important in many areas of contemporary science, including natural product biosynthesis, molecular biology, and biotechnology. Here, we report an integrated platform for accurate profiling of substrate preferences for PTM enzymes. The platform features (i) a combination of mRNA display with next-generation sequencing as an ultrahigh throughput technique for data acquisition and (ii) deep learning for data analysis. The high accuracy (>0.99 in each of two studies) of the resulting deep learning models enables comprehensive analysis of enzymatic substrate preferences. The models can quantify fitness across sequence space, map modification sites, and identify important amino acids in the substrate. To benchmark the platform, we performed profiling of a Ser dehydratase (LazBF) and a Cys/Ser cyclodehydratase (LazDEF), two enzymes from the lactazole biosynthesis pathway. In both studies, our results point to complex enzymatic preferences, which, particularly for LazBF, cannot be reduced to a set of simple rules. The ability of the constructed models to dissect such complexity suggests that the developed platform can facilitate a wider study of PTM enzymes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
An overview of the workflow for the profiling of LazBF substrate preferences. (a) Chemical reaction catalyzed by LazBF. (b) Schematic overview of mRNA display-based selection/antiselection setups. For the full protocol, see Supporting Information 2.3. Ⓟ refers to the puromycin linker used to display the peptides onto cognate mRNAs. Both selection and antiselection assays can be repeated several times to produce libraries of progressively increasing (or decreasing) substrate fitness. (c) Schematic overview of the data analysis pipeline. NGS selection and antiselection data sets are parsed, preprocessed, and labeled. Peptides are represented as positionally encoded matrices of ECFPs, and a supervised CNN classifier is trained on the resulting data to produce models of LazBF substrate preferences. For a complete description of the data analysis pipeline, see Supporting Information 2.5.
Figure 2
Figure 2
mRNA display profiling of LazBF leads to enriched peptide populations suitable for deep learning applications. (a, b) Summary of the selection (a) and antiselection (b) experiments. Plotted are respective DNA recovery and enrichment values measured by qPCR after every round of mRNA display. (c) Data set convergence at the amino acid level as measured by log2Y* scores. Amino acid aa in position pos is enriched in the selection data set compared to the antiselection one if log2Y*aa, pos is greater than 0. See also the definitions in the figure header and Supporting Information 2.1; caa, pos is the number of NGS reads with amino acid aa in position pos in a data set. (d) CNN classifier accuracy as a function of the number of mRNA display rounds. The models were trained on 4.75 × 105 samples from the respective data sets, using 0.25 × 105 unseen samples for validation. Multiple rounds of mRNA display lead to cleaner data sets and, hence, more accurate models. (e) CNN classifier accuracy as a function of the training data set size. The models were trained on round 6 data. Model accuracy scales with the size of the training data set. (f) Validation of model predictions against experimental data. 65 validation peptides (bVP1–65; all encoded in library 5S5; see also Table S4) were expressed by the FIT system and treated with LazBF/GluRS/tRNAGlu for 2 h. Reaction outcomes were analyzed by LC-MS as described in Supporting Information 2.8. Model predictions showed good agreement with the experiment.
Figure 3
Figure 3
Model enables high-level analysis of LazBF substrate fitness landscapes. (a) Experimentally measured modification efficiencies of validation peptides (bVP1–65; Table S4) as a function of their S scores. S scores cannot be used to reliably predict fitness of bVP peptides. (b) Distribution of model predictions in the S-space. Substrate fitness of 5 × 106 random 5S5 peptides was evaluated with the model. Plotted are binned statistics of model predictions in the S-space. The overall distribution of the peptides in the same space is displayed for reference. The analysis reveals that at best S scores can be reliably used as antideterminants of substrate fitness (when S < −5). (c) Pairwise epistasis between variable positions in the CP of 5S5 peptides. The model was utilized to compute abs (epi) scores using predictions for 5 × 106 sequences from b). The resulting values can be used to estimate how strongly amino acids in the substrate affect each other’s fitness. Higher values correspond to stronger second-order effects. See Supporting Information 2.1 for computation details. (d) Analysis of epistatic interactions in bVP33. Average model calls were computed for 2 × 104 partially random in silico generated peptides in each case; “x” denotes any amino acid except Ser. (e) Visualization of all pairwise epistatic interactions in bVP33. Strong epistasis inside the His4-Pro5-Ser6-Arg7-Trp8 motif contributes to the high fitness of the peptide.
Figure 4
Figure 4
Model-guided dissection of the substrate preferences of LazBF. (a) LC-MS analysis of bVP37 dehydration by LazBF [a broad extracted ion chromatogram (brEIC) and a composite MS spectrum integrated over substrate-derived peaks showing the overall product distribution; see Supporting Information 2.8 for LC-MS details]. (b) Atom- and bond-wise accumulated IG attributions for bVP37. The model suggests that Ser10 is the primary determinant of the high modification efficiency. (c) A zoomed-in section of a charge-deconvoluted CID fragmentation spectrum for singly dehydrated bVP37; y-ion assignments and neutral molecule losses are omitted for clarity. The spectrum allows unambiguous assignment of the dehydration site to Ser10, consistent with the model’s suggestion. See Figures S10–12 for more examples. (d) Amino acid-wise IGs provide an intuition for relative amino acid contributions to the total substrate fitness. Experimentally measured increase in modification efficiency for three single-point mutants of bVP32, 36, and 58 underscores the model’s ability to identify amino acids critical for LazBF-mediated dehydration. See Figure S13 for more examples. (e) Substrate space traversal study for bVP29 (see also the accompanying text). The model was employed to find a sequence of bVP29 mutants which alter the substrate fitness at each step. The route identified by the model was validated experimentally. Collectively, this study points to the complex and unintuitive substrate preferences of LazBF.
Figure 5
Figure 5
Substrate specificity profiling for LazDEF. (a) Chemical reactions catalyzed by LazDEF. (b) Design of the LazDEF substrate library, library 6C6. (c) Summary of the selection and antiselection experiments. Plotted are respective DNA recovery and enrichment values measured by qPCR after every round of mRNA display. (d) CNN classifier accuracy as a function of training data set size. The models were trained on round 5 data. (e) Validation of model predictions against experimental data. A total of 64 validation peptides (dVP1–64; Table S5) were expressed by the FIT system and treated with LazDEF for 5 h. Reaction outcomes were analyzed by LC-MS as described in Supporting Information 2.8. Model predictions show good agreement with the experiment. (f) Pairwise epistasis between variable positions in the CP of 6C6 peptides. The model was utilized to compute abs(epi) scores using predictions for 5 × 106 sequences from panel h). The resulting values can be used to estimate how strongly amino acids in the substrate affect each other’s fitness. Higher values correspond to stronger second-order effects. Compared to the results for LazBF, LazDEF substrates are characterized by weaker pairwise epistatic interactions, which aids in explaining the results in panels (g) and (h). See Supporting Information 2.1 for computation details. (g) Experimentally measured modification efficiencies of validation peptides as a function of their S scores. Compared to the LazBF results (Figure 3a), the S scores for LazDEF substrates prove more informative. (h) Distribution of model predictions in the S-space. Substrate fitness of 5 × 106 random 6C6 peptides was evaluated with the model. Plotted are binned statistics of model predictions in the S-space. The overall distribution of the peptides in the same space is displayed for reference. In the interval [−3, 2], which accounts for 46% of the total peptide space, S scores are an unreliable metric of substrate fitness.

References

    1. Arnison P. G.; Bibb M. J.; Bierbaum G.; Bowers A. A.; Bugni T. S.; Bulaj G.; Camarero J. A.; Campopiano D. J.; Challis G. L.; Clardy J.; et al. Ribosomally Synthesized and Post-Translationally Modified Peptide Natural Products: Overview and Recommendations for a Universal Nomenclature. Nat. Prod. Rep. 2013, 30, 108–160. 10.1039/C2NP20085F. - DOI - PMC - PubMed
    1. Montalbán-López M.; Scott T. A.; Ramesh S.; Rahman I. R.; van Heel A. J.; Viel J. H.; Bandarian V.; Dittmann E.; Genilloud O.; Goto Y.; et al. New Developments in RiPP Discovery, Enzymology and Engineering. Nat. Prod. Rep. 2021, 38, 130–239. 10.1039/D0NP00027B. - DOI - PMC - PubMed
    1. Repka L. M.; Chekan J. R.; Nair S. K.; van der Donk W. A. Mechanistic Understanding of Lanthipeptide Biosynthetic Enzymes. Chem. Rev. 2017, 117, 5457–5520. 10.1021/acs.chemrev.6b00591. - DOI - PMC - PubMed
    1. Hegemann J. D.; Süssmuth R. D. Matters of Class: Coming of Age of Class III and IV Lanthipeptides. RSC Chem. Biol. 2020, 1, 110–127. 10.1039/D0CB00073F. - DOI - PMC - PubMed
    1. Sivonen K.; Leikoski N.; Fewer D. P.; Jokela J. Cyanobactins-Ribosomal Cyclic Peptides Produced by Cyanobacteria. Appl. Microbiol. Biotechnol. 2010, 86 (5), 1213–1225. 10.1007/s00253-010-2482-x. - DOI - PMC - PubMed