Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 2;4(2):343-354.
doi: 10.1039/d4dd00170b. eCollection 2025 Feb 12.

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Affiliations

Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning

Joseph D Clark et al. Digit Discov. .

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. (a) The generic biosynthesis pathway of RiPPs. RiPP precursor peptides contain a leader peptide and a core peptide. After post-translational modifications in the core peptide, the leader peptide is cleaved. (b) The lactazole biosynthetic gene cluster contains six proteins. LazA is the precursor peptide. LazB (tRNA-dependent glutamylation enzyme) and the eliminase domain of LazF form a serine dehydratase while LazD (RRE-containing E1-like protein), LazE (YcaO cyclodehydratase), and the dehydrogenase domain of LazF comprise a thiazole synthetase. LazC is a pyridine synthase. (c) Serine dehydration catalyzed by LazBF. (d) Thiazole formation catalyzed by LazDEF.
Fig. 2
Fig. 2. A schematic representation of the workflow for masked language modeling (MLM) of LazBF and LazDEF substrate preferences. (a) LazBF and LazDEF substrate/non-substrate embeddings were extracted from the protein language model ESM-2 (Vanilla-ESM). The baseline performance of downstream classification models was assessed. (b) Peptide language models (Peptide-ESM, LazBF-ESM, LazDEF-ESM, LazBCDEF-ESM) were developed via masked language modeling of 4 peptide data sets. Embeddings were extracted and the performance of downstream substrate prediction models was compared to baseline. (c) Protein language models were further trained to directly classify LazBF/DEF substrates. The models' predictions were analyzed with interpretable machine learning techniques including attention analysis (see Methods).
Fig. 3
Fig. 3. A schematic representation of our data preprocessing pipeline. (a) LazA core sequences (n = 1.3 million) were selected from library 5S5 and used for masked language modeling (MLM) of LazBF substrate preferences. A ‘held-out’ data set of 50 000 peptides was set aside for downstream model training and evaluation. (b) LazA core sequences (n = 1.3 million) were selected from library 6C6 and used for masked language modeling (MLM) of LazDEF substrate preferences. A held-out data set of 50 000 peptides was set aside for downstream model training and evaluation.
Fig. 4
Fig. 4. Accuracy of logistic regression (LR), random forest (RF), AdaBoost (AB), support vector classifier (SVC), and multi-layer perceptron (MLP) models trained to predict LazDEF substrates. Models are trained on embeddings from a protein language model (green), a peptide language model trained on diverse peptides (orange), a peptide language model trained on LazBF substrates/non-substrates (blue), a peptide language model trained on LazDEF substrates/non-substrates (pink), and a peptide language model trained on substrates/non-substrates for the entire lactazole biosynthetic pathway (lime) in the (a) low-N condition (n = 200), (b) medium-N condition (n = 500), and (c) high-N condition (n = 1000). A star indicates the top performing model for each set of embeddings.
Fig. 5
Fig. 5. t-SNE visualization of the LazDEF embedding space for (a) a protein language model, (b) a peptide language model trained on LazBF substrates/non-substrates, and (c) a peptide language model trained on LazDEF substrates/non-substrates. t-SNE visualization of the LazBF embedding space for (d) a protein language model, (e) a peptide language model trained on LazDEF substrates/non-substrates, and (f) a peptide language model trained on LazBF substrates/non-substrates. Substrates are red and non-substrates samples are blue.
Fig. 6
Fig. 6. Accuracy of logistic regression (LR), random forest (RF), AdaBoost (AB), support vector classifier (SVC), and multi-layer perceptron (MLP) models trained to predict LazBF substrates. Models are trained on embeddings from a protein language model (green), a peptide language model trained on diverse peptides (orange), a peptide language model trained on LazBF substrates/non-substrates (blue), a peptide language model trained on LazDEF substrates/non-substrates (pink), and a peptide language model trained on substrates/non-substrates for the entire lactazole biosynthetic pathway (lime) in the (a) low-N condition (n = 200), (b) medium-N condition (n = 500), and (c) high-N condition (n = 1000). A star indicates the top performing model for each set of embeddings.
Fig. 7
Fig. 7. A LazBF substrate prediction model and a LazDEF substrate prediction model produce correlated integrated gradients for LazBF substrates/non-substrates. (a) The average contribution of each position to substrate fitness shows a 0.73 spearman coefficient between the two models. Position 6 is ignored due to containing a fixed serine residue. (b) The average contribution of each amino acid to substrate fitness shows a 0.78 spearman coefficient between the two models. Serine is ignored because its importance for substrate fitness is established.
Fig. 8
Fig. 8. Attention maps from a language model trained to predict LazBF substrates. [BOS] and [EOS] tokens mark the “beginning of sequence” and “end of sequence” respectively. (a) Middle and later layers focus on specific residues and motifs. (b) Attention heads from the penultimate layer highlight a motif with high pairwise epi-scores in a LazBF substrate. (c) Attention heads from the final layer highlight a residue important for substrate fitness in a LazDEF substrate.

Update of

References

    1. Ongpipattanakul C. Desormeaux E. K. DiCaprio A. van der Donk W. A. Mitchell D. A. Nair S. K. Chem. Rev. 2022;122:14722–14814. doi: 10.1021/acs.chemrev.2c00210. - DOI - PMC - PubMed
    1. Fu Y. Jaarsma A. H. Kuipers O. P. Cell. Mol. Life Sci. 2021;78:3921–3940. doi: 10.1007/s00018-021-03759-0. - DOI - PMC - PubMed
    1. Montalbán-López M. Scott T. A. Ramesh S. Rahman I. R. van Heel A. J. Viel J. H. Bandarian V. Dittmann E. Genilloud O. Goto Y. Burgos M. J. G. Hill C. Kim S. Koehnke J. Latham J. A. Link A. J. Martínez B. Nair S. K. Nicolet Y. Rebuffat S. Sahl H.-G. Sareen D. Schmidt E. W. Schmitt L. Severinov K. Süssmuth R. D. Truman A. W. Wang H. Weng J.-K. van Wezel G. P. Zhang Q. Zhong J. Piel J. Mitchell D. A. Kuipers O. P. van der Donk W. A. Nat. Prod. Rep. 2021;38:130–239. doi: 10.1039/D0NP00027B. - DOI - PMC - PubMed
    1. Arnison P. G. Bibb M. J. Bierbaum G. Bowers A. A. Bugni T. S. Bulaj G. Camarero J. A. Campopiano D. J. Challis G. L. Clardy J. Cotter P. D. Craik D. J. Dawson M. Dittmann E. Donadio S. Dorrestein P. C. Entian K.-D. Fischbach M. A. Garavelli J. S. Göransson U. Gruber C. W. Haft D. H. Hemscheidt T. K. Hertweck C. Hill C. Horswill A. R. Jaspars M. Kelly W. L. Klinman J. P. Kuipers O. P. Link A. J. Liu W. Marahiel M. A. Mitchell D. A. Moll G. N. Moore B. S. Müller R. Nair S. K. Nes I. F. Norris G. E. Olivera B. M. Onaka H. Patchett M. L. Piel J. Reaney M. J. T. Rebuffat S. Ross R. P. Sahl H.-G. Schmidt E. W. Selsted M. E. Severinov K. Shen B. Sivonen K. Smith L. Stein T. Süssmuth R. D. Tagg J. R. Tang G.-L. Truman A. W. Vederas J. C. Walsh C. T. Walton J. D. Wenzel S. C. Willey J. M. van der Donk W. A. Nat. Prod. Rep. 2013;30:108–160. doi: 10.1039/C2NP20085F. - DOI - PMC - PubMed
    1. Vinogradov A. A. Chang J. S. Onaka H. Goto Y. Suga H. ACS Cent. Sci. 2022;8:814–824. doi: 10.1021/acscentsci.2c00223. - DOI - PMC - PubMed

LinkOut - more resources