Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 1;16(1):13.
doi: 10.1186/s13015-021-00195-4.

Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution

Affiliations

Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution

Trevor S Frisby et al. Algorithms Mol Biol. .

Abstract

Background: Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints.

Results: We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods.

Conclusion: Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.

Keywords: Active learning; Bayesian optimization; Directed evolution; Gaussian process regression; Protein design; Protein language model; Rational design; Regularization.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Traditional, model-free approaches to directed evolution: (Top) The ‘single mutation walk’ approach to directed evolution. The library of variants is the union of k libraries created by performing saturation mutagenesis at a single location. The resulting library, therefore, has 20k variants. The library is screened to find the single variant that optimizes the measured trait. That variant is fixed and the procedure is repeated for the remaining k-1 positions. (Bottom) The library of variants is created by performing saturation mutagenesis at k positions. The top variants are identified through screening. Those variants are randomly recombined to generate a second library, which is then screened to find the top design
Fig. 2
Fig. 2
Machine learning-assisted directed evolution: The first step in ML-assisted DE is the same as for traditional DE (see Fig. 1). A library of variants is created via mutagenesis. Existing data, S={sk,y}i=1:n are used to train a classifier or regression model, f(sk)y, which is then used to rank variants via an in silico screen. The top variants are then synthesized/cloned and screened using in vitro or in vivo assays. The data from the ith round is added to S and used in subsequent DE rounds
Fig. 3
Fig. 3
ML-assisted directed evolution techniques identify high fitness GB1 variants more frequently than simulated traditional DE approaches. Shown are the fraction of trials (y-axis) that reach less than or equal to a specified fitness (x-axis), where the selection criterion was either a simulated traditional DE approach, or standard or regularized EI, PI, and UCB was the acquisition function. (Left) Expected Improvement: The cumulative-weighted average fitness values are 7.25 for GP + EI + TPLM, 7.24 for GP + EI, and 7.16 for GP + EI + FoldX. (Middle) Probability of improvement: The cumulative-weighted average fitness values are 7.62 for GP + PI + TPLM, 7.17 for GP + PI, and 7.03 for GP + PI + FoldX. (Right) Upper confidence bound: The cumulative-weighted average fitness values are 7.76 for GP + UCB + TPLM, 7.10 for GP + UCB, and 6.38 for GP + UCB + FoldX. (All): The traditional single step and recombination approaches select variants with cumulative-weighted average fitness values of 5.22 and 4.71, respectively
Fig. 4
Fig. 4
Regularization leads to better designs. Shown are the cumulative per batch scores for each protein averaged (± 1 SEM) over 100 trials. GP models were initialized with 20 randomly chosen sequences, and each batch consisted of 19 selected variants. Left: GP + UCB + TPLM selected the GB1 variant with highest average fitness (7.76), Middle: GP + EI + FoldX selected the BRCA1 variant with highest average E3 ubiquitin ligase activity (2.65), and Right: GP + UCB + FoldX selected the Spike variant with highest average ACE2 binding affinity (0.98)
Fig. 5
Fig. 5
Evolution and structure-based regularization biases variant selections towards those that score favorably under multiple criteria. Shown are the regularization scores for variants selected for GB1 (Left), BRCA1 (Middle), and Spike (Right) under each selection criterion. As expected, variants selected by TPLM-regularized methods have higher log-odds under the TPLM than those selected from non-TPLM regularized methods (Top). Similarly, variants selected by FoldX regularized methods have lower ΔΔG values than those selected by non-FoldX methods (Bottom). The figures also show that TPLM-regularized methods tend to improve FoldX scores, and that FoldX-regularized methods tend to improve log-odds, indicating that there is some correlation between log-odds and thermodynamic stability
Fig. 6
Fig. 6
Bayesian selection techniques quickly identify informative sequence patterns. Shown are the per-batch average position-specific entropy of variant selections under the top scoring model for each protein. These include (Top) GP + UCB + TPLM for GB1, (Middle) GP + EI + FoldX for BRCA1, and (Bottom) GP + UCB + FoldX for Spike. Lighter squares denote low entropy decisions, meaning the model selects among fewer residue types at that position in that batch
Fig. 7
Fig. 7
Evolutionary and structure-based regularization biases variant selection towards sequences with desirable properties. Shown are sequence logos for the best performing variant selection method along with their unregularized counterpart. All four residues are shown with the GB1 protein (Left), whereas the positions that correspond to variants with the top five true activity/binding affinity scores are shown for BRCA1 (Middle) and Spike (Right). Highlighted residues denote notable distinctions between the regularized and unregularized sequence selections

References

    1. Lutz S, Bornscheuer UT. Protein engineering handbook. Weinheim: Wiley-VCH; 2012. OCLC: 890049290.
    1. Richardson JS, Richardson DC. The de novo design of protein structures. Trends Biochem Sci. 1989;14(7):304–309. doi: 10.1016/0968-0004(89)90070-4. - DOI - PubMed
    1. Arnold FH. Directed evolution: bringing new chemistry to life. Angew Chem Int Ed. 2018;57(16):4143–4148. doi: 10.1002/anie.201708408. - DOI - PMC - PubMed
    1. Wu Z, Kan SBJ, Lewis RD, Wittmann BJ, Arnold FH. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc Natl Acad Sci. 2019;116(18):8852–8858. doi: 10.1073/pnas.1901979116. - DOI - PMC - PubMed
    1. Starr TN, Thornton JW. Epistasis in protein evolution. Protein Sci. 2016;25(7):1204–1218. doi: 10.1002/pro.2897. - DOI - PMC - PubMed

LinkOut - more resources