. 2022 Jul 5;13(1):3876.

doi: 10.1038/s41467-022-31245-z.

A versatile active learning workflow for optimization of genetic and metabolic networks

Amir Pandi^#¹, Christoph Diehl^#², Ali Yazdizadeh Kharrazi³, Scott A Scholz², Elizaveta Bobkova², Léon Faure⁴, Maren Nattermann², David Adam², Nils Chapin², Yeganeh Foroughijabbari², Charles Moritz², Nicole Paczia⁵, Niña Socorro Cortina^{2

6}, Jean-Loup Faulon^{4

7

8}, Tobias J Erb^{9

10}

Affiliations

¹ Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany. amir.pandi@mpi-marburg.mpg.de.
² Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany.
³ DataChef, Amsterdam, The Netherlands.
⁴ Micalis Institute, INRAE, AgroParisTech, University of Paris-Saclay, Jouy-en-Josas, France.
⁵ Core Facility for Metabolomics and Small Molecule Mass Spectrometry, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany.
⁶ LiVeritas Biosciences, Inc., 432N Canal St.; Ste. 20, South San Francisco, CA, 94080, USA.
⁷ Genomique Metabolique, Genoscope, Institut Francois Jacob, CEA, CNRS, Univ Evry, University of Paris-Saclay, Evry, France.
⁸ Manchester Institute of Biotechnology, SYNBIOCHEM center, School of Chemistry, The University of Manchester, Manchester, UK.
⁹ Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany. toerb@mpi-marburg.mpg.de.
¹⁰ SYNMIKRO Center of Synthetic Microbiology, Marburg, Germany. toerb@mpi-marburg.mpg.de.

^# Contributed equally.

PMID: 35790733
PMCID: PMC9256728
DOI: 10.1038/s41467-022-31245-z

A versatile active learning workflow for optimization of genetic and metabolic networks

Amir Pandi et al. Nat Commun. 2022.

. 2022 Jul 5;13(1):3876.

doi: 10.1038/s41467-022-31245-z.

Authors

Affiliations

¹ Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany. amir.pandi@mpi-marburg.mpg.de.
² Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany.
³ DataChef, Amsterdam, The Netherlands.
⁴ Micalis Institute, INRAE, AgroParisTech, University of Paris-Saclay, Jouy-en-Josas, France.
⁵ Core Facility for Metabolomics and Small Molecule Mass Spectrometry, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany.
⁶ LiVeritas Biosciences, Inc., 432N Canal St.; Ste. 20, South San Francisco, CA, 94080, USA.
⁷ Genomique Metabolique, Genoscope, Institut Francois Jacob, CEA, CNRS, Univ Evry, University of Paris-Saclay, Evry, France.
⁸ Manchester Institute of Biotechnology, SYNBIOCHEM center, School of Chemistry, The University of Manchester, Manchester, UK.
⁹ Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany. toerb@mpi-marburg.mpg.de.
¹⁰ SYNMIKRO Center of Synthetic Microbiology, Marburg, Germany. toerb@mpi-marburg.mpg.de.

^# Contributed equally.

PMID: 35790733
PMCID: PMC9256728
DOI: 10.1038/s41467-022-31245-z

Abstract

Optimization of biological networks is often limited by wet lab labor and cost, and the lack of convenient computational tools. Here, we describe METIS, a versatile active machine learning workflow with a simple online interface for the data-driven optimization of biological targets with minimal experiments. We demonstrate our workflow for various applications, including cell-free transcription and translation, genetic circuits, and a 27-variable synthetic CO₂-fixation cycle (CETCH cycle), improving these systems between one and two orders of magnitude. For the CETCH cycle, we explore 10²⁵ conditions with only 1,000 experiments to yield the most efficient CO₂-fixation cascade described to date. Beyond optimization, our workflow also quantifies the relative importance of individual factors to the performance of a system identifying unknown interactions and bottlenecks. Overall, our workflow opens the way for convenient optimization and prototyping of genetic and metabolic networks with customizable adjustments according to user experience, experimental setup, and laboratory facilities.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Assessing the performance of different algorithms and testing the active learning workflow with minimal data points.**
a An existing dataset of cell-free gene expression compositions composed of 1000 data points was used to build a gold standard regressor and assess the performance of different machine learning algorithms in 10 rounds of active learning. b Top panel: performance of 4 algorithms, multilayer perceptrons (MLP), deep neural networks (DNN), linear regressors, and XGBoost gradient boosting in 10 rounds of active learning (100 data points per round). Bottom panel: performance of the XGBoost gradient boosting algorithm as the selected algorithm with different sample sizes. The boxplots with whisker length of 1.5, represent the minimum, 25th percentile (bottom bound of box), median (center of box), 75th percentile (upper bound of box), and maximum. c An in vitro or cell-free transcription-translation (TXTL) system (based on *E. coli* lysate) to test the workflow with 20 data points per round. A plasmid expressing *sfGfp* was added to TXTL reaction mix along with 13 components of reaction buffer and energy mix. d Overview of the active learning cycle. 13 components are varied starting with random compositions and over 10 rounds of results are imported to the model, which learns and suggests new compositions for improvement of the objective function. e The plot presenting the average of triplicates (n = 3 independent experiments) of the objective function (yield) for compositions in 10 rounds (days) of active learning. The gray lines show the median. f Feature importance percentages show the effect of each factor on the model’s decision to calculate yields for the suggested compositions. g Distribution of different concentrations of each factor within the measured yields. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS.

**Fig. 2. A representation of METIS, a modular active machine learning workflow for biological systems.**
a The first step is choosing an objective function (an output/target that depends on multiple factors), then continuing with the Google Colab Python notebook, performing experiments, and visualizing and analyzing results. b Users should define active learning parameters depending on the application, equipment, and the size of the combinatorial space. Factors’ ranges/categories are conditions that are varied to explore the behavior of the objective function. In each round of active learning, while the users perform experiments and label the suggested combinations with measured objective function values (parameters and factors’ conditions can be readjusted at any round), the data can be analyzed and visualized using the workflow’s modules. See Supplementary Note 2 and Supplementary Figs. 4–6 for a detailed explanation and guide for each step.

**Fig. 3. Application of METIS for optimization of a *LacI* gene circuit.**
a *LacI* gene circuits characterized by dynamic range (DR) and fold-change (FC) of the output (Gfp fluorescence) between 0 and 10 mM IPTG. b Active learning by varying components of *E. coli* TXTL, 4 *lacI* circuit plasmids as alternatives, T7 RNA polymerase and a T7-*lacI* plasmid. c The objective function (FC × DR) and fold change (FC) values, average of triplicates (n = 3 independent experiments) in 10 rounds of active learning. The gray lines show the median. d The distribution yield values within the range of each factor. e Feature importance percentages showing the effect of each factor on the objective function. f Titration of P_T7-*LacI* plasmid and T7 RNA polymerase with the optimal composition (from active learning that achieved with pTHS circuit). The heatmaps show FC × DR (left) and FC (right) values (average of triplicates, n = 3 independent experiments) of the titration. g Fluorescence values (average of triplicates, n = 3 independent experiments) of the similar titration as in f but instead of the pTHS circuit, a *Gfp* expressing plasmid was used). h Titration of *LacI* plasmids with constitutive/T7 promoter in combination with a *Gfp* plasmid with constitutive/T7 promoter. i The RT-qPCR results of the relative level of *LacI* and *Gfp* mRNAs after 10 h. Relative log2 resource share between *LacI and Gfp* mRNA in each sample is reported to account for RNA purification efficiency variability. In h and i bars are the average of triplicates (n = 3 independent experiments) and error bars are standard deviation. j Usage of the METIS module, K most informative combinations for further *LacI* circuit optimization. k Objective function FC × DR and FC (average of triplicates, n = 3 independent experiments) of 20 most informative combinations with purified LacI (Day 0) followed by Day 1 experiments suggested by METIS. The gray lines show the median. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS. Source data for f–i are provided as a Source Data file.

**Fig. 4. Application of METIS for optimization of a transcription & translation unit.**
a The cell-free expression of *sfGfp* (super-folder *Gfp*) using plasmid, linear DNA (PCR) and linear DNA plus GamS protein, a nuclease inhibitor that protects linear DNA from degradation. The bars and the error bars are the average and standard deviation of triplicates (n = 3 independent experiments), respectively. b Design of a transcription & translation unit controlled by variants of a T7 promoter, ribosome binding site (RBS), N-terminal amino acids 3, 4, and 5, and the last two C-terminal amino acids. The combinatorial transcription & translation units are expressed from linear DNA in the TXTL system consisting of the *E. coli* lysate, buffer and energy mix, as well as purified GamS and T7 RNA polymerase. c The plot representing the average of triplicates (n = 3 independent experiments) as the result of 4 rounds of active learning, with 50 transcription & translation units tested per round. The yield is the Gfp fluorescence readout after 6 hours at 30 °C normalized by the same value from the reference constructs commonly used in the lab (Methods). The gray lines show the median. d A list of 20 most informative combinations of 4-day active learning performed in the cell-free system (c) was downloaded and the combinations were cloned in a vector and transformed into *E. coli* DH10β harboring a plasmid expressing auto-regulated T7 RNA polymerase (Methods). e Cell-free versus in vivo yields (average and standard deviation of triplicates, n = 3 independent experiments) for the 20 most informative combinations. f In vivo yield results (average of triplicates, n = 3 independent experiments) of Day 0 (20 most informative combinations) and Day 1 (suggested by the workflow). The gray lines show the median. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS. Source data for a, e are provided as a Source Data file.

**Fig. 5. Application of METIS for optimization of an in vitro CO₂-fixation pathway (CETCH cycle).**
a Reaction sequence of the CETCH cycle (see Methods for enzyme names and information). b Active learning with 125 conditions tested in each round. ECHO^® liquid handler pipetted the combinations and the reactions were started with 100 µM propionyl-CoA and stopped after 3 h. The glycolate content was measured by LC-MS. c Optimization of the CETCH cycle with glycolate yield. d Summary of the optimization and the switch of the objective function. e Transformed data of c (glycolate yield divided by the total amount of enzymes = efficiency) for rounds 1–5, shaded region, and the data of three additional rounds of optimization with efficiency as the objective function (rounds 6–8). The yields in c and e are average of triplicates, (n = 3 independent experiments) and the gray lines show the median. f, g Feature importance of factors for active learning in c and e, respectively. h–l Manually pipetted experiments for seven conditions, three highest glycolate yields (blue, orange and red), a control (black) and three randomly picked underperformed conditions (green, lavender, burgundy) color coded the same in h–l and circled in c and/or e. These plots show glycolate production over 8 h (h) and its first 15 min with slopes (i), initial production rate versus the final glycolate yield (j), total amount of measured CoA esters after 8 h versus the final glycolate yield (k), and quantified CoA esters over 8 h (l). The plotted values in h–k are the average of triplicates (n = 3 independent experiments), and the error bars represent the standard deviation. In (l), bars are the average of triplicates (n = 3 independent experiments), each compound is plotted with error bars in Supplementary Fig. 17. In l the amount of propionyl-CoA within the zero samples is the added amount (100 µM) to start the reaction and was not measured by LC-MS. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS. Source data for h–l are provided as a Source Data file.

See this image and copyright information in PMC

References

1. Purnick PEM, Weiss R. The second wave of synthetic biology: from modules to systems. Nat. Rev. Mol. Cell Biol. 2009;10:410–422. doi: 10.1038/nrm2698. - DOI - PubMed
1. Smanski MJ, et al. Functional optimization of gene clusters by combinatorial design and assembly. Nat. Biotechnol. 2014;32:1241–1249. doi: 10.1038/nbt.3063. - DOI - PubMed
1. Dolberg TB, et al. Computation-guided optimization of split protein systems. Nat. Chem. Biol. 2021;17:531–539. doi: 10.1038/s41589-020-00729-8. - DOI - PMC - PubMed
1. Radivojević T, Costello Z, Workman K, Garcia Martin H. A machine learning automated recommendation tool for synthetic biology. Nat. Commun. 2020;11:4879. doi: 10.1038/s41467-020-18008-4. - DOI - PMC - PubMed
1. Naseri G, Koffas MAG. Application of combinatorial optimization strategies in synthetic biology. Nat. Commun. 2020;11:2446. doi: 10.1038/s41467-020-16175-y. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- Addgene Non-profit plasmid repository
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A versatile active learning workflow for optimization of genetic and metabolic networks

Affiliations

A versatile active learning workflow for optimization of genetic and metabolic networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous