This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Feb 23:arXiv:2402.15181v1.

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning

Joseph D Clark¹, Xuenan Mi², Douglas A Mitchell³, Diwakar Shukla^{2

4

5}

Affiliations

¹ School of Molecular and Cellular Biology,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
² Center for Biophysics and Quantitative Biology,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
³ Department of Chemistry,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
⁴ Department of Chemical and Biomolecular Engineering,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
⁵ Department of Bioengineering,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.

PMID: 38463513
PMCID: PMC10925380

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning

Joseph D Clark et al. ArXiv. 2024.

[Preprint]. 2024 Feb 23:arXiv:2402.15181v1.

Authors

Joseph D Clark¹, Xuenan Mi², Douglas A Mitchell³, Diwakar Shukla^{2

4

5}

Affiliations

¹ School of Molecular and Cellular Biology,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
² Center for Biophysics and Quantitative Biology,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
³ Department of Chemistry,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
⁴ Department of Chemical and Biomolecular Engineering,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.
⁵ Department of Bioengineering,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.

PMID: 38463513
PMCID: PMC10925380

Update in

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning.
Clark JD, Mi X, Mitchell DA, Shukla D. Clark JD, et al. Digit Discov. 2024 Dec 2;4(2):343-354. doi: 10.1039/d4dd00170b. eCollection 2025 Feb 12. Digit Discov. 2024. PMID: 39649639 Free PMC article.

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting such peptide fitness landscapes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream classification models of both LazBF and LazDEF substrates. Similarly, masked language modelling of LazDEF substrate preferences produced embeddings that improved the performance of classification models of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. Our transfer learning method improved performance and data efficiency in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

PubMed Disclaimer

Figures

**Figure 1:**
a) The generic biosynthesis pathway of RiPPs. RiPP precursor peptides contain a leader peptide and a core peptide. After post-translational modifications in the core peptide, the leader peptide is cleaved. b) The lactazole biosynthetic gene cluster contains six proteins. LazA is the precursor peptide. LazB (tRNA-dependent glutamylation enzyme) and the eliminase domain of LazF form a serine dehydratase while LazD (RRE-containing E1-like protein), LazE (YcaO cyclodehydratase), and the dehydrogenase domain of LazF comprise a thiazole synthetase. LazC is a pyridine synthase. c) Serine dehydration catalyzed by LazBF. d) Thiazole formation catalyzed by LazDEF.

**Figure 2:**
A schematic representation of the workflow for masked language modeling of LazBF and LazDEF substrate preferences. a) LazBF and LazDEF substrate/non-substrate embeddings were extracted from the protein language model ESM-2 (Vanilla-ESM). The baseline performance of downstream classification models was assessed. b) 3 copies of Vanilla-ESM were independently trained through masked language modeling of 3 peptide data sets. Embeddings were extracted and the performance of downstream classification models was compared to baseline. c) Models were further trained to directly classify LazBF/DEF substrates. The models’ predictions were analyzed with interpretable machine learning techniques including attention analysis (see methods).

**Figure 3:**
A schematic representation of our data preprocessing pipeline. a) LazA core sequences (n = 1.3 million) were selected from library 5S5. A ‘held-out’ data set of 50,000 peptides was set aside for downstream model training and evaluation. b) LazA core sequences (n = 1.3 million) were selected from library 6C6. A held-out data set of 50,000 peptides was set aside for downstream model training and evaluation.

**Figure 4:**
Accuracy of LazDEF substrate classification models trained on embeddings from Vanilla-ESM (green), ESM trained on a subset of PeptideAtlas (orange), ESM trained on LazBF substrates/non-substrates (blue), and ESM trained on LazDEF substrates/non-substrates (pink) in the a) high-N condition, b) medium-N condition, and c) low-N condition. A star indicates the top performing model for each set of embeddings.

**Figure 5:**
t-SNE visualization of the LazDEF embedding space for a) Vanilla-ESM, b) ESM trained on LazBF substrates/non-substrates, and c) ESM trained on LazDEF substrates/non-substrates. t-SNE visualization of the LazBF embedding space for d) Vanilla-ESM, e) ESM trained on LazDEF substrates/non-substrates, and f) ESM trained on LazBF substrates/non-substrates. Substrates are red and non-substrates samples are blue.

**Figure 6:**
Accuracy of LazBF substrate classification models trained on embeddings from Vanilla-ESM (green), ESM trained on a subset of PeptideAtlas (orange), ESM trained on LazBF substrates/non-substrates (blue), and ESM trained on LazDEF substrates/non-substrates (pink) in the a) high-N condition, b) medium-N condition, and c) low-N condition. A star indicates the top performing model for each set of embeddings.

**Figure 7:**
Fine-tuned LazBF-ESM and fine-tuned LazDEF-ESM produce correlated integrated gradients for LazBF substrates/non-substrates. a) The average contribution of each position to substrate fitness shows a 0.81 spearmanr between the two models. b) The average contribution of each amino acid to substrate fitness shows a 0.80 spearmanr between the two models.

**Figure 8:**
Attention maps from the fine-tuned LazBF-ESM. [BOS] and [EOS] tokens mark the “beginning of sequence” and “end of sequence” respectively. a) Middle and later layers focus on specific residues and motifs. b) Attention heads from the penultimate layer highlight a motif with high pairwise epi-scores in a LazBF substrate. c) Attention heads from the final layer highlight a residue important for substrate fitness in a LazDEF substrate.

See this image and copyright information in PMC

References

1. Ongpipattanakul C.; Desormeaux E. K.; DiCaprio A.; van der Donk W. A.; Mitchell D. A.; Nair S. K. Mechanism of Action of Ribosomally Synthesized and Post-Translationally Modified Peptides. Chemical Reviews 2022, 122, 14722–14814. - PMC - PubMed
1. Fu Y.; Jaarsma A. H.; Kuipers O. P. Antiviral activities and applications of ribosomally synthesized and post-translationally modified peptides (RiPPs). Cellular and Molecular Life Sciences 2021, 78, 3921–3940. - PMC - PubMed
1. Montalbán-López M. et al. New developments in RiPP discovery, enzymology and engineering. Natural Product Reports 2021, 38, 130–239. - PMC - PubMed
1. Arnison P. G. et al. Ribosomally synthesized and post-translationally modified peptide natural products: overview and recommendations for a universal nomenclature. Nat. Prod. Rep. 2013, 30, 108–160. - PMC - PubMed
1. Vinogradov A. A.; Chang J. S.; Onaka H.; Goto Y.; Suga H. Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning. ACS Central Science 2022, 8, 814–824. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning

Affiliations

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning

Authors

Affiliations

Update in

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources