. 2024 Jul 30;15(1):6405.

doi: 10.1038/s41467-024-50712-3.

Neural network extrapolation to distant regions of the protein fitness landscape

Chase R Freschlin^#¹, Sarah A Fahlberg^#¹, Pete Heinzelman¹, Philip A Romero^{2

3}

Affiliations

¹ Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
² Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA. philip.romero@duke.edu.
³ Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA. philip.romero@duke.edu.

^# Contributed equally.

PMID: 39080282
PMCID: PMC11289474
DOI: 10.1038/s41467-024-50712-3

Neural network extrapolation to distant regions of the protein fitness landscape

Chase R Freschlin et al. Nat Commun. 2024.

. 2024 Jul 30;15(1):6405.

doi: 10.1038/s41467-024-50712-3.

Authors

Chase R Freschlin^#¹, Sarah A Fahlberg^#¹, Pete Heinzelman¹, Philip A Romero^{2

3}

Affiliations

¹ Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
² Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA. philip.romero@duke.edu.
³ Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA. philip.romero@duke.edu.

^# Contributed equally.

PMID: 39080282
PMCID: PMC11289474
DOI: 10.1038/s41467-024-50712-3

Abstract

Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Extrapolation of sequence-function models.**
a Supervised sequence-function models are trained on experimental data and can make predictions across the fitness landscape. ML-guided protein design seeks to identify high-fitness sequences and often involves model extrapolation beyond the training regime. b We tested five model architectures that capture distinct aspects of the underlying sequence-function landscape. c A collection of 100 CNN models and their divergence when predicting deep into sequence space along a mutational trajectory. The ensemble predictor EnsM represents the median of the 100 CNNs, while EnsC is the 5^th percentile. d We trained models on GB1 single and double mutants and predicted the fitness of 1-, 2-, 3-, 4-mutants. The Spearman’s rank correlation was determined between the model’s predicted fitness and experimental fitness. e Model recall of the top 100 protein variants within a design budget. Recall represents the number of the true top 100 4-mutants that are present in a model’s top N predictions, where N is the design budget. Optimal represents a theoretical model that always predicts the top N proteins. Shading represents 95% confidence intervals across 100 individually trained models, excluding EnsM and EnsC. The confidence interval is centered on the mean recall. Source data are provided as a Source Data file.

**Fig. 2. ML-guided fitness landscape exploration.**
a Supervised models infer the fitness landscape from sequence-function examples. We use simulated annealing (SA) to search through sequence space for designs with high predicted fitness. We perform hundreds of independent SA runs to broadly search sequence space, cluster designs to map distinct fitness peaks, and select the most fit sequence from each cluster (shown as a star). b We visualized all designs using multidimensional scaling (MDS) and found the designs occupy concentric rings emanating from wild-type GB1 with increasing number of mutations. c We colored the MDS visualization by model architecture and found individual models design sequences that occupy distinct regions of sequence space. d We calculated the sequence diversity across GB1 positions for the 10-mutant designs. We used Shannon entropy to quantify amino acid diversity at each position; low entropy indicates few amino acid options at a given site while high entropy indicates many amino acid options. The low entropy for the LR and FCN indicate that each model repeatedly proposes the same mutations at the same positions. The convolutional models propose sequences with more diversity spread across more positions, especially in regions of positive epistasis. Source data are provided as a Source Data file.

**Fig. 3. Experimental characterization of ML-designed GB1s.**
a An overview of our yeast surface display method to measure IgG-binding of GB1 designs. b FACS scatter plots and sorting gates for the library sorting experiments. In our first qualitative experiments, we sorted variants into bind (blue) and display only (green) populations. Events outside of these gates (red) were not sorted. In our second quantitative experiment, the library was sorted into display-only (green), low-bind (purple), wt-equal binding (pink), or high-bind (tan) categories. The negative population (red) and events falling outside of gates (gray) were not sorted. c The binding and display scores for each model as a function of the Hamming distance from wild-type GB1. Hamming distance reports the number of positions where the amino acids of two aligned sequence are different. In the plot, each point corresponds to a single design and the shaded region specifies the threshold between functional/nonfunctional and displaying/non-displaying. d A scatter plot of display and binding scores for each design. The gray-shaded regions specify the thresholds between functional/nonfunctional and displaying/non-displaying. The percentage of designs falling within each quadrant is specified in each quadrant’s corner. e The distribution of 5-mutant and 10-mutant designs categorized as high-bind, wt-equal, low-bind, or inactive from the quantitative experiment. Most designs beyond 10 mutations were inactive. Source data are provided as a Source Data file.

**Fig. 4. AlphaFold predictions of ML-designed GB1s.**
a Predicted structures across models and Hamming distances. Each model-distance combination has 41 overlayed structures corresponding to each design. b UMAP visualization of predicted structures showing clustering and organization with functional status, Hamming distance, and design model. Source data are provided as a Source Data file.

**Fig. 5. Validation of high-throughput yeast display screen.**
a Clonal yeast display assay to verify designs’ display and IgG binding properties. The display and binding signals were normalized to wild-type GB1. b IgG binding curves for wild-type GB1 and designs FCN-5 (a 5-mutant designed by FCN), EnsC-5 (a 5-mutant designed by EnsC), CNN-10 (a 10-mutant designed by a CNN) and EnsC-20. We analyzed cells using flow cytometry and the normalized binding signal is the ratio of IgG binding to display level. We estimated the $K_{D}$ and max binding signal parameters by fitting the data to the Hill equation. c AlphaFold2 predicted structure of the binding 20-mutant variant EnsC-20. The mutated residues are shown as sticks and highlighted in teal. IgG was included from the GB1 crystal structure (PDB: 1FCC). Source data are provided as a Source Data file.

See this image and copyright information in PMC

Update of

Neural network extrapolation to distant regions of the protein fitness landscape.
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Fahlberg SA, et al. bioRxiv [Preprint]. 2023 Nov 9:2023.11.08.566287. doi: 10.1101/2023.11.08.566287. bioRxiv. 2023. Update in: Nat Commun. 2024 Jul 30;15(1):6405. doi: 10.1038/s41467-024-50712-3. PMID: 37987009 Free PMC article. Updated. Preprint.

Cited by

Investigating the determinants of performance in machine learning for protein fitness prediction.
Sandhu M, Mater AC, Matthews DS, Spence MA, Lenskiy AA, Jackson C. Sandhu M, et al. Protein Sci. 2025 Aug;34(8):e70235. doi: 10.1002/pro.70235. Protein Sci. 2025. PMID: 40689706 Free PMC article.
Machine learning in molecular biophysics: Protein allostery, multi-level free energy simulations, and lipid phase transitions.
Cui Q. Cui Q. Biophys Rev (Melville). 2025 Feb 12;6(1):011305. doi: 10.1063/5.0248589. eCollection 2025 Mar. Biophys Rev (Melville). 2025. PMID: 39957913 Review.
Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.
Yang J, Li FZ, Arnold FH. Yang J, et al. ACS Cent Sci. 2024 Feb 5;10(2):226-241. doi: 10.1021/acscentsci.3c01275. eCollection 2024 Feb 28. ACS Cent Sci. 2024. PMID: 38435522 Free PMC article. Review.
Development of the autonomous lab system to support biotechnology research.
Fushimi K, Nakai Y, Nishi A, Suzuki R, Ikegami M, Nimura R, Tomono T, Hidese R, Yasueda H, Tagawa Y, Hasunuma T. Fushimi K, et al. Sci Rep. 2025 Feb 24;15(1):6648. doi: 10.1038/s41598-025-89069-y. Sci Rep. 2025. PMID: 39994271 Free PMC article.
Designing diverse and high-performance proteins with a large language model in the loop.
Gomez-Uribe CA, Gado J, Islamov M. Gomez-Uribe CA, et al. PLoS Comput Biol. 2025 Jun 5;21(6):e1013119. doi: 10.1371/journal.pcbi.1013119. eCollection 2025 Jun. PLoS Comput Biol. 2025. PMID: 40471987 Free PMC article.

See all "Cited by" articles

References

1. Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol.10, 866–876 (2009). 10.1038/nrm2805 - DOI - PMC - PubMed
1. Freschlin, C. R., Fahlberg, S. A. & Romero, P. A. Machine learning to navigate fitness landscapes for protein engineering. Curr. Opin. Biotechnol.75, 102713 (2022). 10.1016/j.copbio.2022.102713 - DOI - PMC - PubMed
1. Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol.69, 11–18 (2021). 10.1016/j.sbi.2021.01.008 - DOI - PubMed
1. Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput Biol.17, 1–23 (2021).10.1371/journal.pcbi.1008736 - DOI - PMC - PubMed
1. Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol.39, 691–696 (2021). 10.1038/s41587-020-00793-4 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Neural network extrapolation to distant regions of the protein fitness landscape

Affiliations

Neural network extrapolation to distant regions of the protein fitness landscape

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources