Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 30;15(1):6405.
doi: 10.1038/s41467-024-50712-3.

Neural network extrapolation to distant regions of the protein fitness landscape

Affiliations

Neural network extrapolation to distant regions of the protein fitness landscape

Chase R Freschlin et al. Nat Commun. .

Abstract

Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Extrapolation of sequence-function models.
a Supervised sequence-function models are trained on experimental data and can make predictions across the fitness landscape. ML-guided protein design seeks to identify high-fitness sequences and often involves model extrapolation beyond the training regime. b We tested five model architectures that capture distinct aspects of the underlying sequence-function landscape. c A collection of 100 CNN models and their divergence when predicting deep into sequence space along a mutational trajectory. The ensemble predictor EnsM represents the median of the 100 CNNs, while EnsC is the 5th percentile. d We trained models on GB1 single and double mutants and predicted the fitness of 1-, 2-, 3-, 4-mutants. The Spearman’s rank correlation was determined between the model’s predicted fitness and experimental fitness. e Model recall of the top 100 protein variants within a design budget. Recall represents the number of the true top 100 4-mutants that are present in a model’s top N predictions, where N is the design budget. Optimal represents a theoretical model that always predicts the top N proteins. Shading represents 95% confidence intervals across 100 individually trained models, excluding EnsM and EnsC. The confidence interval is centered on the mean recall. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. ML-guided fitness landscape exploration.
a Supervised models infer the fitness landscape from sequence-function examples. We use simulated annealing (SA) to search through sequence space for designs with high predicted fitness. We perform hundreds of independent SA runs to broadly search sequence space, cluster designs to map distinct fitness peaks, and select the most fit sequence from each cluster (shown as a star). b We visualized all designs using multidimensional scaling (MDS) and found the designs occupy concentric rings emanating from wild-type GB1 with increasing number of mutations. c We colored the MDS visualization by model architecture and found individual models design sequences that occupy distinct regions of sequence space. d We calculated the sequence diversity across GB1 positions for the 10-mutant designs. We used Shannon entropy to quantify amino acid diversity at each position; low entropy indicates few amino acid options at a given site while high entropy indicates many amino acid options. The low entropy for the LR and FCN indicate that each model repeatedly proposes the same mutations at the same positions. The convolutional models propose sequences with more diversity spread across more positions, especially in regions of positive epistasis. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Experimental characterization of ML-designed GB1s.
a An overview of our yeast surface display method to measure IgG-binding of GB1 designs. b FACS scatter plots and sorting gates for the library sorting experiments. In our first qualitative experiments, we sorted variants into bind (blue) and display only (green) populations. Events outside of these gates (red) were not sorted. In our second quantitative experiment, the library was sorted into display-only (green), low-bind (purple), wt-equal binding (pink), or high-bind (tan) categories. The negative population (red) and events falling outside of gates (gray) were not sorted. c The binding and display scores for each model as a function of the Hamming distance from wild-type GB1. Hamming distance reports the number of positions where the amino acids of two aligned sequence are different. In the plot, each point corresponds to a single design and the shaded region specifies the threshold between functional/nonfunctional and displaying/non-displaying. d A scatter plot of display and binding scores for each design. The gray-shaded regions specify the thresholds between functional/nonfunctional and displaying/non-displaying. The percentage of designs falling within each quadrant is specified in each quadrant’s corner. e The distribution of 5-mutant and 10-mutant designs categorized as high-bind, wt-equal, low-bind, or inactive from the quantitative experiment. Most designs beyond 10 mutations were inactive. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. AlphaFold predictions of ML-designed GB1s.
a Predicted structures across models and Hamming distances. Each model-distance combination has 41 overlayed structures corresponding to each design. b UMAP visualization of predicted structures showing clustering and organization with functional status, Hamming distance, and design model. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Validation of high-throughput yeast display screen.
a Clonal yeast display assay to verify designs’ display and IgG binding properties. The display and binding signals were normalized to wild-type GB1. b IgG binding curves for wild-type GB1 and designs FCN-5 (a 5-mutant designed by FCN), EnsC-5 (a 5-mutant designed by EnsC), CNN-10 (a 10-mutant designed by a CNN) and EnsC-20. We analyzed cells using flow cytometry and the normalized binding signal is the ratio of IgG binding to display level. We estimated the KD and max binding signal parameters by fitting the data to the Hill equation. c AlphaFold2 predicted structure of the binding 20-mutant variant EnsC-20. The mutated residues are shown as sticks and highlighted in teal. IgG was included from the GB1 crystal structure (PDB: 1FCC). Source data are provided as a Source Data file.

Update of

Similar articles

Cited by

References

    1. Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol.10, 866–876 (2009). 10.1038/nrm2805 - DOI - PMC - PubMed
    1. Freschlin, C. R., Fahlberg, S. A. & Romero, P. A. Machine learning to navigate fitness landscapes for protein engineering. Curr. Opin. Biotechnol.75, 102713 (2022). 10.1016/j.copbio.2022.102713 - DOI - PMC - PubMed
    1. Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol.69, 11–18 (2021). 10.1016/j.sbi.2021.01.008 - DOI - PubMed
    1. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput Biol.17, 1–23 (2021).10.1371/journal.pcbi.1008736 - DOI - PMC - PubMed
    1. Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol.39, 691–696 (2021). 10.1038/s41587-020-00793-4 - DOI - PubMed

Substances

LinkOut - more resources