Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 15;12(9):2600-2615.
doi: 10.1021/acssynbio.3c00196. Epub 2023 Aug 29.

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Affiliations

Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation

Alexander W Golinski et al. ACS Synth Biol. .

Abstract

Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and stability─hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 105 of 1020 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.

Keywords: developability; landscape; model; predictive; protein; sequence.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Prediction of protein developability via transfer learning.
A sequence-based model to predict developability is trained in two steps. Task 1 (blue, top): The large database of protein assay scores is used to train a mapping (Model A1) from amino acid sequence to HT assay scores through a learned developability latent space representation (DevRep). Task 2 (orange, bottom): By transferring the representation, the expression yield (a traditional metric of developability) can be predicted when training a top model with a smaller dataset.
Figure 2.
Figure 2.. Protein embedding strategies based on interacting amino acid properties predict HT developability assay scores.
a) The Gp2 paratope residues are embedded as learned amino acid properties and are combined via three different strategies into a developability representation, identified via a red outline. b) Embedded and non-embedded (OH) architectures were trained to predict assay scores via cross-validation (CV) and evaluated on an independent test set of sequences (independent 2-way Student’s t-test for embeddings vs. non-embeddings p<0.05). c) The convolutional architecture’s predictions are compared to the true assay scores (Prot: protease resistance, GFP: soluble expression in split-GFP system, βLac: modularity in split-β-lactamase) as a kernel density plot. The number of unique Gp2 variants and the Spearman’s rank correlation are displayed.
Figure 3.
Figure 3.. Transferred convolutional embedding predicts yield more accurately than traditional embedding strategy.
a) Cross validation and Test performances of predicting yield comparing a traditional OH embedding to protein inspired embeddings trained by HT assay scores. b) The convolutional embedding with a support vector machine top model’s prediction of yields versus experimentally measured yield across E. coli strains Iq and SH.
Figure 4.
Figure 4.. On-yeast protease assay is most informative and transfer learning enables discovery of true signal from imperfect HT assay proxies.
a) A developability representation and top model to yield was trained with combinations of HT assays. The prediction error of sequence yield is grouped by assay combination and colored by embedding architecture. Error bars represent standard deviation of loss from N = 10 stochastically trained embeddings and top models. b) Yield predictions from assay scores and the most accurate trained embeddings for each combination of HT assays suggests transfer learning is more accurate than models that take as input the experimental assay scores.
Figure 5.
Figure 5.. Alternative model cross-validation and test performance.
a) DevRep controls (first outlined in Figure 3a). b) Predicted high-throughput assays are used to predict yield. c) Sequence-to-yield model trained on yields predicted from experimental HT data.
Figure 6.
Figure 6.. Analysis of trained embeddings reveals properties related to developability.
a) Principal components (PC) of the 19-dimensional amino acid embedding, colored by category of residue. EV = explained variance. b) Inter- and intra- residue category distances highlighting the uniqueness of cysteine and lack of difference between aromatic and aliphatic residues. c) Clusters of sequences were identified via UMAP and hdbscan of the 45,433 sequences used for training. d) Developability, as predicted by yield, varies between clusters trained on HT assay scores.
Figure 7.
Figure 7.. HT assay trained embedding contains more developability information than alternative embeddings.
a) Comparison of protein representations’ ability to predict yield as represented by the loss on an independent set of sequences. b) Variants were plotted using UMAP for each embedding. (top) Color represents experimentally measured developability. (bottom) Sequences were clustered by UMAP coordinates. Color represents unique clusters. c) Variance in predicted yield across sequences within a given cluster. d) The correlation between the intracluster yield variance and the corresponding models’ (trained using the same embedding) predictive performance confirms that models that cluster sequences with similar yield also achieve better predictive performance, indicating that the embedding is informative about the predicted quantity (yield).
Figure 8.
Figure 8.. Nested sampling characterizes the developability-sequence landscape.
a) Nested sampling was performed using 100 evolving sequences while accepting mutations with yields above the threshold per iteration. The threshold yield and corresponding sequences were determined by the lowest yield of the evolving sequences. b) The density of states for each level of developability was determined and used to estimate the expected developability, heat capacity, and entropy at various inverse temperatures (selective pressure in this context). Two main phase transitions are identified with a dashed line. c,d) The UMAP representation displays the landscape splitting into distinct clusters of DevRep space above the transition. Recorded sequences’ predicted developabilities increase from red to purple. e) The disconnectivity plot for the sequence space displays a landscape with competing developability peaks (when β grow large enough that a lower peak becomes depleted and a higher one enriched, we observe a phase transition).
Figure 9.
Figure 9.. Assessment of DevRep-suggested high developability variants.
a) Sequence embeddings identified through either nested sampling (left) or simulated annealing (right) strategies were clustered via UMAP (top) (Note: we only show the DevRep embedding here). The highest predicted yield variants in each cluster were equally sampled to determine 100 sequences. These variants represent a diverse set of sequences for experimental testing (bottom). b) Predicted developability distributions according to DevRep using equal inter-cluster sampling techniques across the sequence variants using different embeddings as in (a). c) UMAP visualization of top developability variants according to DevRep. Note that the UMAP visualizations of suggested top developability variants for nested sampling and simulated annealing in a) are shown in aggregate in c).
Figure 10.
Figure 10.. DevRep enables design of developable protein variants.
a) The predicted versus actual developability of 280 Iq and 269 SH variants identified via sampling strategies (see Figures 9, S6, and S7). b) Sequences generated by each embedding and sampling strategy are compared to each other and to a selection of randomly generated sequences. c) An additional set of sequences identified via nested sampling of DevRep and UniRep were also compared. These sequences were designed to be more developable and more similar in embedding space. d) Each sequence in (c) was compared to the set of sequences with measured yield that was used during model training. The distribution shown is broken down by the model used to generate the sequences.

Similar articles

Cited by

References

    1. Gebauer M, and Skerra A. (2020). Engineered protein scaffolds as next-generation therapeutics. Annu. Rev. Pharmacol. Toxicol. 60, 391–415. - PubMed
    1. Borrebaeck CAK (2017). Precision diagnostics: moving towards protein biomarker signatures of clinical utility in cancer. Nat. Rev. Cancer 17, 199–204. 10.1038/nrc.2016.153. - DOI - PubMed
    1. Kennedy PJ, Oliveira C, Granja PL, and Sarmento B. (2017). Antibodies and associates: Partners in targeted drug delivery. Pharmacol. Ther. 177, 129–145. 10.1016/j.pharmthera.2017.03.004. - DOI - PubMed
    1. Arbige MV, Shetty JK, and Chotani GK (2019). Industrial Enzymology: The Next Chapter. Trends Biotechnol. 37, 1355–1366. 10.1016/j.tibtech.2019.09.010. - DOI - PubMed
    1. Engqvist MKM, and Rabe KS (2019). Applications of Protein Engineering and Directed Evolution in Plant Research. Plant Physiol. 179, 907–917. 10.1104/pp.18.01534. - DOI - PMC - PubMed

Publication types

LinkOut - more resources