Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 16;16(1):714.
doi: 10.1038/s41467-025-55987-8.

Active learning-assisted directed evolution

Affiliations

Active learning-assisted directed evolution

Jason Yang et al. Nat Commun. .

Abstract

Directed evolution (DE) is a powerful tool to optimize protein fitness for a specific application. However, DE can be inefficient when mutations exhibit non-additive, or epistatic, behavior. Here, we present Active Learning-assisted Directed Evolution (ALDE), an iterative machine learning-assisted DE workflow that leverages uncertainty quantification to explore the search space of proteins more efficiently than current DE methods. We apply ALDE to an engineering landscape that is challenging for DE: optimization of five epistatic residues in the active site of an enzyme. In three rounds of wet-lab experimentation, we improve the yield of a desired product of a non-native cyclopropanation reaction from 12% to 93%. We also perform computational simulations on existing protein sequence-fitness datasets to support our argument that ALDE can be more effective than DE. Overall, ALDE is a practical and broadly applicable strategy to unlock improved protein engineering outcomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Conceptual differences between DE and ALDE.
A A common workflow for DE, where a starting protein is mutated and fitnesses of variants are measured (screened). The best variant is used as the starting point for the next round of mutation and screening, until desired fitness is achieved. B Conceptualization of DE as greedy hill climbing optimization on a hypothetical protein fitness landscape. C Workflow for ALDE. An initial training library is generated, where k residues are mutated simultaneously (for example k = 5). A small subset of this library is randomly picked, after which the variants are sequenced and their fitnesses are screened. A supervised ML model with uncertainty quantification is trained to learn a mapping from sequence to fitness. An acquisition function is used to propose new variants to test, balancing exploration (high uncertainty) and exploitation (high predicted fitness). The process is repeated until desired fitness is achieved. D Conceptualization of active learning on a hypothetical protein fitness landscape. Active learning is often more effective than DE for finding optimal combinations of mutations. In these conceptualizations, a single sequence is queried in each round, but in practical settings, active learning operates in batch where multiple sequences are tested in each round.
Fig. 2
Fig. 2. A challenging, epistatic protein design space: optimization of five active site residues in ParPgb.
A Our objective was to optimize an enzyme to catalyze the formation of the cis product of a cyclopropanatiom reaction with high yield and high selectivity, which we quantify in a single value as cis – trans Yield. B The parent protein ParLQ is two mutations (W59L and V60Q) away from the wild-type ParPgb sequence. Five residues in the active site of ParLQ which were likely to exhibit epistasis were targeted: W56, Y57, L59, Q60, and F89. C The single mutations from parent at the five targeted sites do not offer significant improvements to the objective of cistrans Yield. Very few single-mutation variants have the desired selectivity (positive cistrans Yield), and it would not be obvious which variant to take forward in a DE campaign. Parent yields vary between runs but consistently show moderate yield and selectivity for the trans product. D Various recombinations of ideal single mutations are not effective proteins for the desired objective (cistrans Yield), and related metrics such as cis Yield and cis/trans Selectivity. DAYFW, DGMDW, and DGMVW are the ideal combinations of single mutations naively predicted to have the highest cis Yield, cistrans Yield (objective), and cis/trans Selectivity, respectively. Yields were measured in biological triplicate. Overall, these results suggest an optimization problem that is challenging for standard DE methods. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. ALDE optimization trajectory on the ParPgb active site.
The optimization campaign started with (A) constructing an initial library with mutations at all five sites under study using NNK degenerate codons, randomly selecting 384 for screening for product formation, and mapping to sequences using LevSeq. This was followed by two rounds of ALDE–(B) Round 1 and (C) Round 2. In Round 1 and Round2, exact genes were ordered as ENFINIA DNA produced by Elegen Corp. and screened for product formation. For each round, we present the distribution of amino acids sampled at each site and the distribution of yields for the cis and trans products, with a few of the top-performing variants labeled. Overall improvement in (D) cistrans Yield, (E) Total Yield, and (F) cis/trans Selectivity over several rounds of ALDE for the best variant in each round and the mean across variants in each round. The best variant in each round, defined by the obejctive of cistrans Yield is labeled. Error bars indicate the standard deviation across variants in the round. Yields were measured in biological triplicate. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Performance of simulated ALDE campaigns on two combinatorially complete protein datasets, GB1 and TrpB.
A Each DE simulation as a greedy single-step walk on four residues, where each residue is fixed to the optimal mutation until all four residues have been iterated across. DE simulations start from every variant that has some measurable function, with all 24 possible orderings of four residues simulated. B Each ALDE simulation starts from a random sample of 96 variants on the 4-site landscape, with four rounds of learning and proposing new sequences to test, each with 96 protein variants. C Hypothetical visualization of the three acquisition functions explored in this work: greedy, upper confidence bound (UCB), and Thompson sampling (TS). D ALDE for four encodings, four models, and three acquisition functions generally outperforms the average DE simulation and random sampling on the GB1 and TrpB datasets. Performance is quantified as the normalized maximum fitness achieved by the end of the ALDE campaign. Error bars indicate standard deviation across 70 random initializations. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Analysis of uncertainty quantification on simulated ALDE campaigns.
A Metrics used to evaluate how well calibrated each of the four models are for four encodings. Metrics for evaluation are the mean absolute error (MAE), the miscalibration area for the calibration curve, and the Spearman correlation between uncertainty and error. All metrics are calculated based on all measured points in the combinatorial design space. All results are based on the campaigns using UCB as the acquisition function, during the final round of the campaign. Error bars indicate standard deviation across 70 random initializations. B Visualizations of three hypothetical models with underconfident, calibrated, and overconfident uncertainties, and their respective calibration curves. C Visualization of how the Spearman correlation between uncertainty and error is calculated. Source data are provided as a Source Data file.

References

    1. Reisenbauer, J. C., Sicinski, K. M. & Arnold, F. H. Catalyzing the future: recent advances in chemical synthesis using enzymes. Curr. Opin. Chem. Biol.83, 102536 (2024). - PMC - PubMed
    1. Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Bio10, 866–876 (2009). - PMC - PubMed
    1. Smith, J. M. Natural selection and the concept of a protein space. Nature225, 563–564 (1970). - PubMed
    1. Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nat. Rev. Genet.16, 379–394 (2015). - PubMed
    1. Wang, Y. et al. Directed evolution: methodologies and applications. Chem. Rev.121, 12384–12444 (2021). - PubMed

LinkOut - more resources