Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 17;17(1):21.
doi: 10.1186/s13321-025-00969-7.

Predictive modeling of biodegradation pathways using transformer architectures

Affiliations

Predictive modeling of biodegradation pathways using transformer architectures

Liam Brydon et al. J Cheminform. .

Abstract

In recent years, the integration of machine learning techniques into chemical reaction product prediction has opened new avenues for understanding and predicting the behaviour of chemical substances. The necessity for such predictive methods stems from the growing regulatory and social awareness of the environmental consequences associated with the persistence and accumulation of chemical residues. Traditional biodegradation prediction methods rely on expert knowledge to perform predictions. However, creating this expert knowledge is becoming increasingly prohibitive due to the complexity and diversity of newer datasets, leaving existing methods unable to perform predictions on these datasets. We formulate the product prediction problem as a sequence-to-sequence generation task and take inspiration from natural language processing and other reaction prediction tasks. In doing so, we reduce the need for the expensive manual creation of expert-based rules.

Keywords: Biodegradation; Cheminformatics; Product prediction; Transfer-learning; Transformer.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: JSW is one of the founders of enviPath UG & Co. KG, a scientific software development company that develops and maintains enviPath. The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
How pathway prediction is performed. A root molecule is given to a prediction method which recursively predicts products from reactants until an arbitrary stopping condition is reached
Fig. 2
Fig. 2
An example reaction SMILES where 1,2,4-triazole degrades into Triazole-alanine
Fig. 3
Fig. 3
An overview of the enviFormer method. During training reactants and products are extracted from the train-set reactions and fed to the transformer. The transformers output is then compared with the true product to calculate the loss for the model to learn from. During inference a reactant SMILES is given to the transformer, which generates a list of potential products with associated probabilities. Depending on the threshold used different products will form the final set of output products
Fig. 4
Fig. 4
This example is different reactions 1,2,4-triazole undergoes taken from the Soil dataset. There are two cases of multiple products, separate reactions represented by two different SMILES and one reaction producing multiple products represented by one SMILES
Fig. 5
Fig. 5
A beam decoding example using a simple ABC set of tokens to show how beam decoding is performed, starting with the SOS token. In this example, the beam width is two, and the two output sequences are ABC and CAC, with probabilities of 0.38 and 0.09, respectively
Fig. 6
Fig. 6
Multi Generation evaluation example. Note that compound F does not contribute to the score as it is an intermediate product and compound D gets its score adjusted as if it was an immediate product of compound B. With a threshold of 0.3, we get Precision=(B+D)/(B+D)=1, and Recall=(B+D)/(B+C+D)=0.75
Fig. 7
Fig. 7
We show enviFormer’s performance on different training sets. Comparing the cyan USPTO line to other shows the benefits of transfer learning
Fig. 8
Fig. 8
A comparison to the models trained with rules extracted by experts and extracted by enviRule on the BBD dataset
Fig. 9
Fig. 9
A comparison to the rules extracted by experts and extracted by enviRule on the soil dataset
Fig. 10
Fig. 10
A comparison to the models trained with rules extracted by experts and extracted by enviRule from BBD and Soil and then evaluated on the Sludge dataset
Fig. 11
Fig. 11
The runtime of enviFormer with (green) and without (orange) GPU acceleration compared to the hybrid rule based method (blue). Pathway represents the prediction time from the root. For the batch sizes the time reported is the time per batch. * We used a Python implementation of the Ensemble of Classifier Chains method utilised by enviPath [8] with 317 rules extracted by enviRule

References

    1. E Union (2020) Regulation (EC) No 1907/2006 of the European Parliament and of the Council of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), establishing a European Chemicals Agency, amending Directive 1999/45/EC and repealing Council Regulation (EEC) No 793/93 and Commission Regulation (EC) No 1488/94 as well as Council Directive 76/769/EEC and Commission Directives 91/155/EEC, 93/67/EEC, 93/105/EC and 2000/21/EC (Text with EEA relevance) Text with EEA relevance. Legislative Body: OP_DATPRO. http://data.europa.eu/eli/reg/2006/1907/2020-08-24/eng. Accessed on 12 Mar 2024
    1. E Union (2012) Regulation (EU) No 528/2012 of the European Parliament and of the Council of 22 May 2012 concerning the making available on the market and use of biocidal products Text with EEA relevance. Legislative Body: CONSIL, EP. http://data.europa.eu/eli/reg/2012/528/oj/eng. Accessed 2 Dec 2024
    1. Ellis LBM, Roe D, Wackett LP (2006) The University of Minnesota Biocatalysis/Biodegradation Database: the first decade. Nucleic Acids Res 34(Database issue):517–521. 10.1093/nar/gkj076 - PMC - PubMed
    1. Wicker J, Fenner K, Ellis L, Wackett L, Kramer S (2010) Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach. Bioinformatics 26(6):814–821. 10.1093/bioinformatics/btq024 - PubMed
    1. Fenner K, Gao J, Kramer S, Ellis L, Wackett L (2008) Data-driven extraction of relative reasoning rules to limit combinatorial explosion in biodegradation pathway prediction. Bioinformatics 24(18):2079–2085. 10.1093/bioinformatics/btn378 - PubMed

LinkOut - more resources