Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Feb 23:arXiv:2402.15181v1.

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning

Affiliations

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning

Joseph D Clark et al. ArXiv. .

Update in

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting such peptide fitness landscapes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream classification models of both LazBF and LazDEF substrates. Similarly, masked language modelling of LazDEF substrate preferences produced embeddings that improved the performance of classification models of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. Our transfer learning method improved performance and data efficiency in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
a) The generic biosynthesis pathway of RiPPs. RiPP precursor peptides contain a leader peptide and a core peptide. After post-translational modifications in the core peptide, the leader peptide is cleaved. b) The lactazole biosynthetic gene cluster contains six proteins. LazA is the precursor peptide. LazB (tRNA-dependent glutamylation enzyme) and the eliminase domain of LazF form a serine dehydratase while LazD (RRE-containing E1-like protein), LazE (YcaO cyclodehydratase), and the dehydrogenase domain of LazF comprise a thiazole synthetase. LazC is a pyridine synthase. c) Serine dehydration catalyzed by LazBF. d) Thiazole formation catalyzed by LazDEF.
Figure 2:
Figure 2:
A schematic representation of the workflow for masked language modeling of LazBF and LazDEF substrate preferences. a) LazBF and LazDEF substrate/non-substrate embeddings were extracted from the protein language model ESM-2 (Vanilla-ESM). The baseline performance of downstream classification models was assessed. b) 3 copies of Vanilla-ESM were independently trained through masked language modeling of 3 peptide data sets. Embeddings were extracted and the performance of downstream classification models was compared to baseline. c) Models were further trained to directly classify LazBF/DEF substrates. The models’ predictions were analyzed with interpretable machine learning techniques including attention analysis (see methods).
Figure 3:
Figure 3:
A schematic representation of our data preprocessing pipeline. a) LazA core sequences (n = 1.3 million) were selected from library 5S5. A ‘held-out’ data set of 50,000 peptides was set aside for downstream model training and evaluation. b) LazA core sequences (n = 1.3 million) were selected from library 6C6. A held-out data set of 50,000 peptides was set aside for downstream model training and evaluation.
Figure 4:
Figure 4:
Accuracy of LazDEF substrate classification models trained on embeddings from Vanilla-ESM (green), ESM trained on a subset of PeptideAtlas (orange), ESM trained on LazBF substrates/non-substrates (blue), and ESM trained on LazDEF substrates/non-substrates (pink) in the a) high-N condition, b) medium-N condition, and c) low-N condition. A star indicates the top performing model for each set of embeddings.
Figure 5:
Figure 5:
t-SNE visualization of the LazDEF embedding space for a) Vanilla-ESM, b) ESM trained on LazBF substrates/non-substrates, and c) ESM trained on LazDEF substrates/non-substrates. t-SNE visualization of the LazBF embedding space for d) Vanilla-ESM, e) ESM trained on LazDEF substrates/non-substrates, and f) ESM trained on LazBF substrates/non-substrates. Substrates are red and non-substrates samples are blue.
Figure 6:
Figure 6:
Accuracy of LazBF substrate classification models trained on embeddings from Vanilla-ESM (green), ESM trained on a subset of PeptideAtlas (orange), ESM trained on LazBF substrates/non-substrates (blue), and ESM trained on LazDEF substrates/non-substrates (pink) in the a) high-N condition, b) medium-N condition, and c) low-N condition. A star indicates the top performing model for each set of embeddings.
Figure 7:
Figure 7:
Fine-tuned LazBF-ESM and fine-tuned LazDEF-ESM produce correlated integrated gradients for LazBF substrates/non-substrates. a) The average contribution of each position to substrate fitness shows a 0.81 spearmanr between the two models. b) The average contribution of each amino acid to substrate fitness shows a 0.80 spearmanr between the two models.
Figure 8:
Figure 8:
Attention maps from the fine-tuned LazBF-ESM. [BOS] and [EOS] tokens mark the “beginning of sequence” and “end of sequence” respectively. a) Middle and later layers focus on specific residues and motifs. b) Attention heads from the penultimate layer highlight a motif with high pairwise epi-scores in a LazBF substrate. c) Attention heads from the final layer highlight a residue important for substrate fitness in a LazDEF substrate.

Similar articles

References

    1. Ongpipattanakul C.; Desormeaux E. K.; DiCaprio A.; van der Donk W. A.; Mitchell D. A.; Nair S. K. Mechanism of Action of Ribosomally Synthesized and Post-Translationally Modified Peptides. Chemical Reviews 2022, 122, 14722–14814. - PMC - PubMed
    1. Fu Y.; Jaarsma A. H.; Kuipers O. P. Antiviral activities and applications of ribosomally synthesized and post-translationally modified peptides (RiPPs). Cellular and Molecular Life Sciences 2021, 78, 3921–3940. - PMC - PubMed
    1. Montalbán-López M. et al. New developments in RiPP discovery, enzymology and engineering. Natural Product Reports 2021, 38, 130–239. - PMC - PubMed
    1. Arnison P. G. et al. Ribosomally synthesized and post-translationally modified peptide natural products: overview and recommendations for a universal nomenclature. Nat. Prod. Rep. 2013, 30, 108–160. - PMC - PubMed
    1. Vinogradov A. A.; Chang J. S.; Onaka H.; Goto Y.; Suga H. Accurate Models of Substrate Preferences of Post-Translational Modification Enzymes from a Combination of mRNA Display and Deep Learning. ACS Central Science 2022, 8, 814–824. - PMC - PubMed

Publication types

LinkOut - more resources