. 2023 Sep 15;12(9):2600-2615.

doi: 10.1021/acssynbio.3c00196. Epub 2023 Aug 29.

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Alexander W Golinski¹, Zachary D Schmitz¹, Gregory H Nielsen¹, Bryce Johnson¹, Diya Saha¹, Sandhya Appiah¹, Benjamin J Hackel¹, Stefano Martiniani^{1

2

3

4}

Affiliations

¹ Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, Minnesota 55455, United States.
² Center for Soft Matter Research, Department of Physics, New York University, New York, New York 10003, United States.
³ Simons Center for Computational Physical Chemistry, Departments of Chemistry, New York University, New York, New York 10003, United States.
⁴ Courant Institute of Mathematical Sciences, New York University, New York, New York 10003, United States.

PMID: 37642646
PMCID: PMC10829850
DOI: 10.1021/acssynbio.3c00196

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Alexander W Golinski et al. ACS Synth Biol. 2023.

. 2023 Sep 15;12(9):2600-2615.

doi: 10.1021/acssynbio.3c00196. Epub 2023 Aug 29.

Authors

Alexander W Golinski¹, Zachary D Schmitz¹, Gregory H Nielsen¹, Bryce Johnson¹, Diya Saha¹, Sandhya Appiah¹, Benjamin J Hackel¹, Stefano Martiniani^{1

2

3

4}

Affiliations

¹ Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, Minnesota 55455, United States.
² Center for Soft Matter Research, Department of Physics, New York University, New York, New York 10003, United States.
³ Simons Center for Computational Physical Chemistry, Departments of Chemistry, New York University, New York, New York 10003, United States.
⁴ Courant Institute of Mathematical Sciences, New York University, New York, New York 10003, United States.

PMID: 37642646
PMCID: PMC10829850
DOI: 10.1021/acssynbio.3c00196

Abstract

Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and stability─hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 10⁵ of 10²⁰ possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.

Keywords: developability; landscape; model; predictive; protein; sequence.

PubMed Disclaimer

Figures

**Figure 1.. Prediction of protein developability via transfer learning.**
A sequence-based model to predict developability is trained in two steps. Task 1 (blue, top): The large database of protein assay scores is used to train a mapping (Model A1) from amino acid sequence to HT assay scores through a learned developability latent space representation (DevRep). Task 2 (orange, bottom): By transferring the representation, the expression yield (a traditional metric of developability) can be predicted when training a top model with a smaller dataset.

**Figure 2.. Protein embedding strategies based on interacting amino acid properties predict HT developability assay scores.**
a) The Gp2 paratope residues are embedded as learned amino acid properties and are combined via three different strategies into a developability representation, identified via a red outline. b) Embedded and non-embedded (OH) architectures were trained to predict assay scores via cross-validation (CV) and evaluated on an independent test set of sequences (independent 2-way Student’s t-test for embeddings vs. non-embeddings p<0.05). c) The convolutional architecture’s predictions are compared to the true assay scores (Prot: protease resistance, GFP: soluble expression in split-GFP system, βLac: modularity in split-β-lactamase) as a kernel density plot. The number of unique Gp2 variants and the Spearman’s rank correlation are displayed.

**Figure 3.. Transferred convolutional embedding predicts yield more accurately than traditional embedding strategy.**
a) Cross validation and Test performances of predicting yield comparing a traditional OH embedding to protein inspired embeddings trained by HT assay scores. b) The convolutional embedding with a support vector machine top model’s prediction of yields versus experimentally measured yield across *E. coli* strains I^q and SH.

**Figure 4.. On-yeast protease assay is most informative and transfer learning enables discovery of true signal from imperfect HT assay proxies.**
a) A developability representation and top model to yield was trained with combinations of HT assays. The prediction error of sequence yield is grouped by assay combination and colored by embedding architecture. Error bars represent standard deviation of loss from N = 10 stochastically trained embeddings and top models. b) Yield predictions from assay scores and the most accurate trained embeddings for each combination of HT assays suggests transfer learning is more accurate than models that take as input the experimental assay scores.

**Figure 5.. Alternative model cross-validation and test performance.**
a) DevRep controls (first outlined in Figure 3a). b) Predicted high-throughput assays are used to predict yield. c) Sequence-to-yield model trained on yields predicted from experimental HT data.

**Figure 6.. Analysis of trained embeddings reveals properties related to developability.**
a) Principal components (PC) of the 19-dimensional amino acid embedding, colored by category of residue. EV = explained variance. b) Inter- and intra- residue category distances highlighting the uniqueness of cysteine and lack of difference between aromatic and aliphatic residues. c) Clusters of sequences were identified via UMAP and hdbscan of the 45,433 sequences used for training. d) Developability, as predicted by yield, varies between clusters trained on HT assay scores.

**Figure 7.. HT assay trained embedding contains more developability information than alternative embeddings.**
a) Comparison of protein representations’ ability to predict yield as represented by the loss on an independent set of sequences. b) Variants were plotted using UMAP for each embedding. *(top)* Color represents experimentally measured developability. *(bottom)* Sequences were clustered by UMAP coordinates. Color represents unique clusters. c) Variance in predicted yield across sequences within a given cluster. d) The correlation between the intracluster yield variance and the corresponding models’ (trained using the same embedding) predictive performance confirms that models that cluster sequences with similar yield also achieve better predictive performance, indicating that the embedding is informative about the predicted quantity (yield).

**Figure 8.. Nested sampling characterizes the developability-sequence landscape.**
a) Nested sampling was performed using 100 evolving sequences while accepting mutations with yields above the threshold per iteration. The threshold yield and corresponding sequences were determined by the lowest yield of the evolving sequences. b) The density of states for each level of developability was determined and used to estimate the expected developability, heat capacity, and entropy at various inverse temperatures (selective pressure in this context). Two main phase transitions are identified with a dashed line. **c,d)** The UMAP representation displays the landscape splitting into distinct clusters of DevRep space above the transition. Recorded sequences’ predicted developabilities increase from red to purple. e) The disconnectivity plot for the sequence space displays a landscape with competing developability peaks (when β grow large enough that a lower peak becomes depleted and a higher one enriched, we observe a phase transition).

**Figure 9.. Assessment of DevRep-suggested high developability variants.**
a) Sequence embeddings identified through either nested sampling (left) or simulated annealing (right) strategies were clustered via UMAP (top) (Note: we only show the DevRep embedding here). The highest predicted yield variants in each cluster were equally sampled to determine 100 sequences. These variants represent a diverse set of sequences for experimental testing (bottom). b) Predicted developability distributions according to DevRep using equal inter-cluster sampling techniques across the sequence variants using different embeddings as in (a). c) UMAP visualization of top developability variants according to DevRep. Note that the UMAP visualizations of suggested top developability variants for nested sampling and simulated annealing in a) are shown in aggregate in c).

**Figure 10.. DevRep enables design of developable protein variants.**
a) The predicted versus actual developability of 280 I^q and 269 SH variants identified via sampling strategies (see Figures 9, S6, and S7). b) Sequences generated by each embedding and sampling strategy are compared to each other and to a selection of randomly generated sequences. c) An additional set of sequences identified via nested sampling of DevRep and UniRep were also compared. These sequences were designed to be more developable and more similar in embedding space. d) Each sequence in (c) was compared to the set of sequences with measured yield that was used during model training. The distribution shown is broken down by the model used to generate the sequences.

See this image and copyright information in PMC

Cited by

Sequence-developability mapping of affibody and fibronectin paratopes via library-scale variant characterization.
Nielsen GH, Schmitz ZD, Hackel BJ. Nielsen GH, et al. Protein Eng Des Sel. 2024 Jan 29;37:gzae010. doi: 10.1093/protein/gzae010. Protein Eng Des Sel. 2024. PMID: 38836499 Free PMC article.
Engineering Affibody Binders to Death Receptor 5 and Tumor Necrosis Factor Receptor 1 With Improved Stability.
Nielsen GH, Sachs JN, Hackel BJ. Nielsen GH, et al. Biotechnol Bioeng. 2025 Jun;122(6):1386-1396. doi: 10.1002/bit.28954. Epub 2025 Mar 5. Biotechnol Bioeng. 2025. PMID: 40045532 Free PMC article.
Multi-Objective Design of DNA-Stabilized Nanoclusters Using Variational Autoencoders With Automatic Feature Extraction.
Sadeghi E, Mastracco P, Gonzàlez-Rosell A, Copp SM, Bogdanov P. Sadeghi E, et al. ACS Nano. 2024 Oct 1;18(39):26997-27008. doi: 10.1021/acsnano.4c09640. Epub 2024 Sep 17. ACS Nano. 2024. PMID: 39288200 Free PMC article.

References

1. Gebauer M, and Skerra A. (2020). Engineered protein scaffolds as next-generation therapeutics. Annu. Rev. Pharmacol. Toxicol. 60, 391–415. - PubMed
1. Borrebaeck CAK (2017). Precision diagnostics: moving towards protein biomarker signatures of clinical utility in cancer. Nat. Rev. Cancer 17, 199–204. 10.1038/nrc.2016.153. - DOI - PubMed
1. Kennedy PJ, Oliveira C, Granja PL, and Sarmento B. (2017). Antibodies and associates: Partners in targeted drug delivery. Pharmacol. Ther. 177, 129–145. 10.1016/j.pharmthera.2017.03.004. - DOI - PubMed
1. Arbige MV, Shetty JK, and Chotani GK (2019). Industrial Enzymology: The Next Chapter. Trends Biotechnol. 37, 1355–1366. 10.1016/j.tibtech.2019.09.010. - DOI - PubMed
1. Engqvist MKM, and Rabe KS (2019). Applications of Protein Engineering and Directed Evolution in Plant Research. Plant Physiol. 179, 907–917. 10.1104/pp.18.01534. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Affiliations

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources