Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 5;13(1):3876.
doi: 10.1038/s41467-022-31245-z.

A versatile active learning workflow for optimization of genetic and metabolic networks

Affiliations

A versatile active learning workflow for optimization of genetic and metabolic networks

Amir Pandi et al. Nat Commun. .

Abstract

Optimization of biological networks is often limited by wet lab labor and cost, and the lack of convenient computational tools. Here, we describe METIS, a versatile active machine learning workflow with a simple online interface for the data-driven optimization of biological targets with minimal experiments. We demonstrate our workflow for various applications, including cell-free transcription and translation, genetic circuits, and a 27-variable synthetic CO2-fixation cycle (CETCH cycle), improving these systems between one and two orders of magnitude. For the CETCH cycle, we explore 1025 conditions with only 1,000 experiments to yield the most efficient CO2-fixation cascade described to date. Beyond optimization, our workflow also quantifies the relative importance of individual factors to the performance of a system identifying unknown interactions and bottlenecks. Overall, our workflow opens the way for convenient optimization and prototyping of genetic and metabolic networks with customizable adjustments according to user experience, experimental setup, and laboratory facilities.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Assessing the performance of different algorithms and testing the active learning workflow with minimal data points.
a An existing dataset of cell-free gene expression compositions composed of 1000 data points was used to build a gold standard regressor and assess the performance of different machine learning algorithms in 10 rounds of active learning. b Top panel: performance of 4 algorithms, multilayer perceptrons (MLP), deep neural networks (DNN), linear regressors, and XGBoost gradient boosting in 10 rounds of active learning (100 data points per round). Bottom panel: performance of the XGBoost gradient boosting algorithm as the selected algorithm with different sample sizes. The boxplots with whisker length of 1.5, represent the minimum, 25th percentile (bottom bound of box), median (center of box), 75th percentile (upper bound of box), and maximum. c An in vitro or cell-free transcription-translation (TXTL) system (based on E. coli lysate) to test the workflow with 20 data points per round. A plasmid expressing sfGfp was added to TXTL reaction mix along with 13 components of reaction buffer and energy mix. d Overview of the active learning cycle. 13 components are varied starting with random compositions and over 10 rounds of results are imported to the model, which learns and suggests new compositions for improvement of the objective function. e The plot presenting the average of triplicates (n = 3 independent experiments) of the objective function (yield) for compositions in 10 rounds (days) of active learning. The gray lines show the median. f Feature importance percentages show the effect of each factor on the model’s decision to calculate yields for the suggested compositions. g Distribution of different concentrations of each factor within the measured yields. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS.
Fig. 2
Fig. 2. A representation of METIS, a modular active machine learning workflow for biological systems.
a The first step is choosing an objective function (an output/target that depends on multiple factors), then continuing with the Google Colab Python notebook, performing experiments, and visualizing and analyzing results. b Users should define active learning parameters depending on the application, equipment, and the size of the combinatorial space. Factors’ ranges/categories are conditions that are varied to explore the behavior of the objective function. In each round of active learning, while the users perform experiments and label the suggested combinations with measured objective function values (parameters and factors’ conditions can be readjusted at any round), the data can be analyzed and visualized using the workflow’s modules. See Supplementary Note 2 and Supplementary Figs. 4–6 for a detailed explanation and guide for each step.
Fig. 3
Fig. 3. Application of METIS for optimization of a LacI gene circuit.
a LacI gene circuits characterized by dynamic range (DR) and fold-change (FC) of the output (Gfp fluorescence) between 0 and 10 mM IPTG. b Active learning by varying components of E. coli TXTL, 4 lacI circuit plasmids as alternatives, T7 RNA polymerase and a T7-lacI plasmid. c The objective function (FC × DR) and fold change (FC) values, average of triplicates (n = 3 independent experiments) in 10 rounds of active learning. The gray lines show the median. d The distribution yield values within the range of each factor. e Feature importance percentages showing the effect of each factor on the objective function. f Titration of PT7-LacI plasmid and T7 RNA polymerase with the optimal composition (from active learning that achieved with pTHS circuit). The heatmaps show FC × DR (left) and FC (right) values (average of triplicates, n = 3 independent experiments) of the titration. g Fluorescence values (average of triplicates, n = 3 independent experiments) of the similar titration as in f but instead of the pTHS circuit, a Gfp expressing plasmid was used). h Titration of LacI plasmids with constitutive/T7 promoter in combination with a Gfp plasmid with constitutive/T7 promoter. i The RT-qPCR results of the relative level of LacI and Gfp mRNAs after 10 h. Relative log2 resource share between LacI and Gfp mRNA in each sample is reported to account for RNA purification efficiency variability. In h and i bars are the average of triplicates (n = 3 independent experiments) and error bars are standard deviation. j Usage of the METIS module, K most informative combinations for further LacI circuit optimization. k Objective function FC × DR and FC (average of triplicates, n = 3 independent experiments) of 20 most informative combinations with purified LacI (Day 0) followed by Day 1 experiments suggested by METIS. The gray lines show the median. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS. Source data for fi are provided as a Source Data file.
Fig. 4
Fig. 4. Application of METIS for optimization of a transcription & translation unit.
a The cell-free expression of sfGfp (super-folder Gfp) using plasmid, linear DNA (PCR) and linear DNA plus GamS protein, a nuclease inhibitor that protects linear DNA from degradation. The bars and the error bars are the average and standard deviation of triplicates (n = 3 independent experiments), respectively. b Design of a transcription & translation unit controlled by variants of a T7 promoter, ribosome binding site (RBS), N-terminal amino acids 3, 4, and 5, and the last two C-terminal amino acids. The combinatorial transcription & translation units are expressed from linear DNA in the TXTL system consisting of the E. coli lysate, buffer and energy mix, as well as purified GamS and T7 RNA polymerase. c The plot representing the average of triplicates (n = 3 independent experiments) as the result of 4 rounds of active learning, with 50 transcription & translation units tested per round. The yield is the Gfp fluorescence readout after 6 hours at 30 °C normalized by the same value from the reference constructs commonly used in the lab (Methods). The gray lines show the median. d A list of 20 most informative combinations of 4-day active learning performed in the cell-free system (c) was downloaded and the combinations were cloned in a vector and transformed into E. coli DH10β harboring a plasmid expressing auto-regulated T7 RNA polymerase (Methods). e Cell-free versus in vivo yields (average and standard deviation of triplicates, n = 3 independent experiments) for the 20 most informative combinations. f In vivo yield results (average of triplicates, n = 3 independent experiments) of Day 0 (20 most informative combinations) and Day 1 (suggested by the workflow). The gray lines show the median. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS. Source data for a, e are provided as a Source Data file.
Fig. 5
Fig. 5. Application of METIS for optimization of an in vitro CO2-fixation pathway (CETCH cycle).
a Reaction sequence of the CETCH cycle (see Methods for enzyme names and information). b Active learning with 125 conditions tested in each round. ECHO® liquid handler pipetted the combinations and the reactions were started with 100 µM propionyl-CoA and stopped after 3 h. The glycolate content was measured by LC-MS. c Optimization of the CETCH cycle with glycolate yield. d Summary of the optimization and the switch of the objective function. e Transformed data of c (glycolate yield divided by the total amount of enzymes = efficiency) for rounds 1–5, shaded region, and the data of three additional rounds of optimization with efficiency as the objective function (rounds 6–8). The yields in c and e are average of triplicates, (n = 3 independent experiments) and the gray lines show the median. f, g Feature importance of factors for active learning in c and e, respectively. hl Manually pipetted experiments for seven conditions, three highest glycolate yields (blue, orange and red), a control (black) and three randomly picked underperformed conditions (green, lavender, burgundy) color coded the same in hl and circled in c and/or e. These plots show glycolate production over 8 h (h) and its first 15 min with slopes (i), initial production rate versus the final glycolate yield (j), total amount of measured CoA esters after 8 h versus the final glycolate yield (k), and quantified CoA esters over 8 h (l). The plotted values in hk are the average of triplicates (n = 3 independent experiments), and the error bars represent the standard deviation. In (l), bars are the average of triplicates (n = 3 independent experiments), each compound is plotted with error bars in Supplementary Fig. 17. In l the amount of propionyl-CoA within the zero samples is the added amount (100 µM) to start the reaction and was not measured by LC-MS. The Google Colab Python notebook and all active learning data (combinations and yields) in this figure are available at https://github.com/amirpandi/METIS. Source data for hl are provided as a Source Data file.

References

    1. Purnick PEM, Weiss R. The second wave of synthetic biology: from modules to systems. Nat. Rev. Mol. Cell Biol. 2009;10:410–422. doi: 10.1038/nrm2698. - DOI - PubMed
    1. Smanski MJ, et al. Functional optimization of gene clusters by combinatorial design and assembly. Nat. Biotechnol. 2014;32:1241–1249. doi: 10.1038/nbt.3063. - DOI - PubMed
    1. Dolberg TB, et al. Computation-guided optimization of split protein systems. Nat. Chem. Biol. 2021;17:531–539. doi: 10.1038/s41589-020-00729-8. - DOI - PMC - PubMed
    1. Radivojević T, Costello Z, Workman K, Garcia Martin H. A machine learning automated recommendation tool for synthetic biology. Nat. Commun. 2020;11:4879. doi: 10.1038/s41467-020-18008-4. - DOI - PMC - PubMed
    1. Naseri G, Koffas MAG. Application of combinatorial optimization strategies in synthetic biology. Nat. Commun. 2020;11:2446. doi: 10.1038/s41467-020-16175-y. - DOI - PMC - PubMed

Publication types