. 2024 Dec 2;4(2):343-354.

doi: 10.1039/d4dd00170b. eCollection 2025 Feb 12.

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Joseph D Clark¹, Xuenan Mi², Douglas A Mitchell^{3

4}, Diwakar Shukla^{2

5

6

7}

Affiliations

¹ School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
² Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
³ Department of Biochemistry, Vanderbilt University School of Medicine Nashville TN 37232 USA.
⁴ Department of Chemistry, Vanderbilt University Nashville TN 37232 USA.
⁵ Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA diwakar@illinois.edu.
⁶ Department of Bioengineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
⁷ Department of Chemistry, University of Illinois at Urbana-Chamapaign Urbana IL 61801 USA.

PMID: 39649639
PMCID: PMC11622008
DOI: 10.1039/d4dd00170b

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Joseph D Clark et al. Digit Discov. 2024.

. 2024 Dec 2;4(2):343-354.

doi: 10.1039/d4dd00170b. eCollection 2025 Feb 12.

Authors

Joseph D Clark¹, Xuenan Mi², Douglas A Mitchell^{3

4}, Diwakar Shukla^{2

5

6

7}

Affiliations

¹ School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
² Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
³ Department of Biochemistry, Vanderbilt University School of Medicine Nashville TN 37232 USA.
⁴ Department of Chemistry, Vanderbilt University Nashville TN 37232 USA.
⁵ Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA diwakar@illinois.edu.
⁶ Department of Bioengineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
⁷ Department of Chemistry, University of Illinois at Urbana-Chamapaign Urbana IL 61801 USA.

PMID: 39649639
PMCID: PMC11622008
DOI: 10.1039/d4dd00170b

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

This journal is © The Royal Society of Chemistry.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1. (a) The generic biosynthesis pathway of RiPPs. RiPP precursor peptides contain a leader peptide and a core peptide. After post-translational modifications in the core peptide, the leader peptide is cleaved. (b) The lactazole biosynthetic gene cluster contains six proteins. LazA is the precursor peptide. LazB (tRNA-dependent glutamylation enzyme) and the eliminase domain of LazF form a serine dehydratase while LazD (RRE-containing E1-like protein), LazE (YcaO cyclodehydratase), and the dehydrogenase domain of LazF comprise a thiazole synthetase. LazC is a pyridine synthase. (c) Serine dehydration catalyzed by LazBF. (d) Thiazole formation catalyzed by LazDEF.

Fig. 2. A schematic representation of the workflow for masked language modeling (MLM) of LazBF and LazDEF substrate preferences. (a) LazBF and LazDEF substrate/non-substrate embeddings were extracted from the protein language model ESM-2 (Vanilla-ESM). The baseline performance of downstream classification models was assessed. (b) Peptide language models (Peptide-ESM, LazBF-ESM, LazDEF-ESM, LazBCDEF-ESM) were developed *via* masked language modeling of 4 peptide data sets. Embeddings were extracted and the performance of downstream substrate prediction models was compared to baseline. (c) Protein language models were further trained to directly classify LazBF/DEF substrates. The models' predictions were analyzed with interpretable machine learning techniques including attention analysis (see Methods).

Fig. 3. A schematic representation of our data preprocessing pipeline. (a) LazA core sequences (n = 1.3 million) were selected from library 5S5 and used for masked language modeling (MLM) of LazBF substrate preferences. A ‘held-out’ data set of 50 000 peptides was set aside for downstream model training and evaluation. (b) LazA core sequences (n = 1.3 million) were selected from library 6C6 and used for masked language modeling (MLM) of LazDEF substrate preferences. A held-out data set of 50 000 peptides was set aside for downstream model training and evaluation.

Fig. 4. Accuracy of logistic regression (LR), random forest (RF), AdaBoost (AB), support vector classifier (SVC), and multi-layer perceptron (MLP) models trained to predict LazDEF substrates. Models are trained on embeddings from a protein language model (green), a peptide language model trained on diverse peptides (orange), a peptide language model trained on LazBF substrates/non-substrates (blue), a peptide language model trained on LazDEF substrates/non-substrates (pink), and a peptide language model trained on substrates/non-substrates for the entire lactazole biosynthetic pathway (lime) in the (a) low-N condition (n = 200), (b) medium-N condition (n = 500), and (c) high-N condition (n = 1000). A star indicates the top performing model for each set of embeddings.

Fig. 5. t-SNE visualization of the LazDEF embedding space for (a) a protein language model, (b) a peptide language model trained on LazBF substrates/non-substrates, and (c) a peptide language model trained on LazDEF substrates/non-substrates. t-SNE visualization of the LazBF embedding space for (d) a protein language model, (e) a peptide language model trained on LazDEF substrates/non-substrates, and (f) a peptide language model trained on LazBF substrates/non-substrates. Substrates are red and non-substrates samples are blue.

Fig. 6. Accuracy of logistic regression (LR), random forest (RF), AdaBoost (AB), support vector classifier (SVC), and multi-layer perceptron (MLP) models trained to predict LazBF substrates. Models are trained on embeddings from a protein language model (green), a peptide language model trained on diverse peptides (orange), a peptide language model trained on LazBF substrates/non-substrates (blue), a peptide language model trained on LazDEF substrates/non-substrates (pink), and a peptide language model trained on substrates/non-substrates for the entire lactazole biosynthetic pathway (lime) in the (a) low-N condition (n = 200), (b) medium-N condition (n = 500), and (c) high-N condition (n = 1000). A star indicates the top performing model for each set of embeddings.

Fig. 7. A LazBF substrate prediction model and a LazDEF substrate prediction model produce correlated integrated gradients for LazBF substrates/non-substrates. (a) The average contribution of each position to substrate fitness shows a 0.73 spearman coefficient between the two models. Position 6 is ignored due to containing a fixed serine residue. (b) The average contribution of each amino acid to substrate fitness shows a 0.78 spearman coefficient between the two models. Serine is ignored because its importance for substrate fitness is established.

Fig. 8. Attention maps from a language model trained to predict LazBF substrates. [BOS] and [EOS] tokens mark the “beginning of sequence” and “end of sequence” respectively. (a) Middle and later layers focus on specific residues and motifs. (b) Attention heads from the penultimate layer highlight a motif with high pairwise epi-scores in a LazBF substrate. (c) Attention heads from the final layer highlight a residue important for substrate fitness in a LazDEF substrate.

See this image and copyright information in PMC

Update of

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning.
Clark JD, Mi X, Mitchell DA, Shukla D. Clark JD, et al. ArXiv [Preprint]. 2024 Feb 23:arXiv:2402.15181v1. ArXiv. 2024. Update in: Digit Discov. 2024 Dec 2;4(2):343-354. doi: 10.1039/d4dd00170b. PMID: 38463513 Free PMC article. Updated. Preprint.

References

1. Ongpipattanakul C. Desormeaux E. K. DiCaprio A. van der Donk W. A. Mitchell D. A. Nair S. K. Chem. Rev. 2022;122:14722–14814. doi: 10.1021/acs.chemrev.2c00210. - DOI - PMC - PubMed
1. Fu Y. Jaarsma A. H. Kuipers O. P. Cell. Mol. Life Sci. 2021;78:3921–3940. doi: 10.1007/s00018-021-03759-0. - DOI - PMC - PubMed
1. Montalbán-López M. Scott T. A. Ramesh S. Rahman I. R. van Heel A. J. Viel J. H. Bandarian V. Dittmann E. Genilloud O. Goto Y. Burgos M. J. G. Hill C. Kim S. Koehnke J. Latham J. A. Link A. J. Martínez B. Nair S. K. Nicolet Y. Rebuffat S. Sahl H.-G. Sareen D. Schmidt E. W. Schmitt L. Severinov K. Süssmuth R. D. Truman A. W. Wang H. Weng J.-K. van Wezel G. P. Zhang Q. Zhong J. Piel J. Mitchell D. A. Kuipers O. P. van der Donk W. A. Nat. Prod. Rep. 2021;38:130–239. doi: 10.1039/D0NP00027B. - DOI - PMC - PubMed
1. Arnison P. G. Bibb M. J. Bierbaum G. Bowers A. A. Bugni T. S. Bulaj G. Camarero J. A. Campopiano D. J. Challis G. L. Clardy J. Cotter P. D. Craik D. J. Dawson M. Dittmann E. Donadio S. Dorrestein P. C. Entian K.-D. Fischbach M. A. Garavelli J. S. Göransson U. Gruber C. W. Haft D. H. Hemscheidt T. K. Hertweck C. Hill C. Horswill A. R. Jaspars M. Kelly W. L. Klinman J. P. Kuipers O. P. Link A. J. Liu W. Marahiel M. A. Mitchell D. A. Moll G. N. Moore B. S. Müller R. Nair S. K. Nes I. F. Norris G. E. Olivera B. M. Onaka H. Patchett M. L. Piel J. Reaney M. J. T. Rebuffat S. Ross R. P. Sahl H.-G. Schmidt E. W. Selsted M. E. Severinov K. Shen B. Sivonen K. Smith L. Stein T. Süssmuth R. D. Tagg J. R. Tang G.-L. Truman A. W. Vederas J. C. Walsh C. T. Walton J. D. Wenzel S. C. Willey J. M. van der Donk W. A. Nat. Prod. Rep. 2013;30:108–160. doi: 10.1039/C2NP20085F. - DOI - PMC - PubMed
1. Vinogradov A. A. Chang J. S. Onaka H. Goto Y. Suga H. ACS Cent. Sci. 2022;8:814–824. doi: 10.1021/acscentsci.2c00223. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Affiliations

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Grants and funding

LinkOut - more resources

Full Text Sources