This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Dec 21:2023.12.19.572475.

doi: 10.1101/2023.12.19.572475.

Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution

Varun R Shanker^{1

2

3}, Theodora U J Bruun^{2

3

4}, Brian L Hie^{3

4}, Peter S Kim^{3

4

5}

Affiliations

¹ Stanford Biophysics Program, Stanford University School of Medicine, Stanford, CA 94305, USA.
² Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford CA 94305, USA.
³ Sarafan ChEM-H, Stanford University, Stanford, CA 94305, USA.
⁴ Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305, USA.
⁵ Chan Zuckerberg Biohub, San Francisco, CA 94158, USA.

PMID: 38187780
PMCID: PMC10769282
DOI: 10.1101/2023.12.19.572475

Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution

Varun R Shanker et al. bioRxiv. 2023.

[Preprint]. 2023 Dec 21:2023.12.19.572475.

doi: 10.1101/2023.12.19.572475.

Authors

Varun R Shanker^{1

2

3}, Theodora U J Bruun^{2

3

4}, Brian L Hie^{3

4}, Peter S Kim^{3

4

5}

Affiliations

¹ Stanford Biophysics Program, Stanford University School of Medicine, Stanford, CA 94305, USA.
² Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford CA 94305, USA.
³ Sarafan ChEM-H, Stanford University, Stanford, CA 94305, USA.
⁴ Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305, USA.
⁵ Chan Zuckerberg Biohub, San Francisco, CA 94158, USA.

PMID: 38187780
PMCID: PMC10769282
DOI: 10.1101/2023.12.19.572475

Update in

Unsupervised evolution of protein and antibody complexes with a structure-informed language model.
Shanker VR, Bruun TUJ, Hie BL, Kim PS. Shanker VR, et al. Science. 2024 Jul 5;385(6704):46-53. doi: 10.1126/science.adk8946. Epub 2024 Jul 4. Science. 2024. PMID: 38963838 Free PMC article.

Abstract

Large language models trained on sequence information alone are capable of learning high level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here we show that a general protein language model augmented with protein structure backbone coordinates and trained on the inverse folding problem can guide evolution for diverse proteins without needing to explicitly model individual functional tasks. We demonstrate inverse folding to be an effective unsupervised, structure-based sequence optimization strategy that also generalizes to multimeric complexes by implicitly learning features of binding and amino acid epistasis. Using this approach, we screened ~30 variants of two therapeutic clinical antibodies used to treat SARS-CoV-2 infection and achieved up to 26-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants-of-concern BQ.1.1 and XBB.1.5, respectively. In addition to substantial overall improvements in protein function, we find inverse folding performs with leading experimental success rates among other reported machine learning-guided directed evolution methods, without requiring any task-specific training data.

PubMed Disclaimer

Conflict of interest statement

Competing interests V.R.S., B.L.H., and P.S.K. are named as inventors on a patent application applied for by Stanford University and the Chan Zuckerberg Biohub entitled “Antibody Compositions and Optimization Methods”.

Figures

**Figure 1:. Guiding evolution of diverse proteins via inverse folding**
**(A)** The inverse folding problem refers to the prediction of a protein’s native amino acid sequence, given its three-dimensional backbone structure, which is conceptually analogous to the opposite problem solved by structure prediction tools like AlphaFold. **(B)** A hybrid autoregressive model integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence. Amino acids from the protein sequence are tokenized (red), combined with geometric features extracted from a structural encoder (green), and modeled with an encoder-decoder transformer (purple). Sequences assigned high likelihoods by the model represent high confidence in folding into the input backbone structure. **(C)** Our structure-guided framework for protein design indirectly explores the underlying fitness landscape, without modeling a specific definition of fitness or requiring any task-specific training data, by constraining the search space to regions where the backbone fold preserved. **(D)** High fitness sensitivity analysis reveals that multimodal input improves language model performance compared to sequence-only input across 10 proteins from diverse protein families (left). ‘Fraction High fitness’ is the fraction of the top ten single amino acid substitutions recommended by each model that are ranked in the top indicated percentile of all experimentally screened variants. A representative plot (right) demonstrates this metric for assessing enrichment of high-fitness MAPK1 mutations, with successfully predicted mutations highlighted (blue) on the empirical cumulative density function (ECDF) of the experimental data (black). The three different thresholds, as defined by percentiles, are also shown as dashed lines. Inverse folding predictions are more enriched, on average, for high fitness variants across various tested thresholds for high fitness classification. bla, Beta-lactamase TEM; CALM1, Calmodulin-1; haeIIIM, Type II methyltransferase M.HaeIII; HRAS, GTPase HRas; MAPK1, Mitogen-activated protein kinase; TMPT, Thiopurine S-methyltransferase; TPK1, Thiamin pyrophosphokinase 1; UBI4, Polyubiquitin; UBE2I, SUMO-conjugating enzyme UBC9

**Figure 2:. Inverse folding of antibody-antigen complexes resolves mutational landscapes by implicitly learning features of binding and protein epistasis**
**(A)** Spearman correlation using inverse folding as well as sequence-based modeling approaches ESM-v and abYsis reported for three antibodies screened with corresponding influenza A HA subtypes H1, H3, and H9. Bars are colored by the type of model used: IF, Inverse Folding (green); LM, Language Model (orange); and MSA, Multiple Sequence Alignment (purple). Inverse folding was evaluated in three different settings: i) providing the entire antibody variable region and antigen complex (Ab-Ag) ii) providing only the antibody variable region (Ab only), and iii) providing only the single antibody variable region of the chain responsible for binding or being mutated (Ab VH only or Ab VH/VL only). Inverse folding implicitly learns features of binding and protein epistasis. For example, when scoring combinatorial mutations to CR9114 against H1, we find that the model has much higher performance (Spearman ρ = 0.65 for H1, 0.5 for H3) than a masked language model ESM-1v (Spearman ρ = 0.08 for H1, 0.09 for H3) and a site-independent, alignment-based model abYsis (Spearman ρ = 0.08 for H1, 0.1 for H3). This performance improvement is also consistent across the other combinatorial landscapes tested. **(B)** Scatter plots showing inverse folding predictions against experimentally determined dissociation constants of CR6261 against HA-H1(left) and HA-H9 (right). The germline and mature sequences are highlighted on all plots as indicated in the legend. For visualization, all scatter plots omit points on the lower limit of quantitation. Further analysis of assay limit on predictive performance is shown in Supplementary Figure 2. **(C)** Conceptual schematic representation of protein language performance improvements with improved priors. Providing sequence and structural information of both the antibody and antigen enables inverse folding to most efficiently identify complex destabilizing mutations and enrich for high fitness antibody variants.

**Figure 3.. Inverse folding-guided evolution of antibodies improves neutralization potency and resilience**
**(A)** Each point represents the fold-change in IC50 of pseudovirus neutralization for antibody variants with single amino acid mutations. Antibodies are tested against the viral strain represented in the input structure (Ly1404- Wuhan, SA58-BA.1 Omicron). A dashed line is shown at fold-change of 1 corresponding to no change. 35% of Ly-1404 variants and 30% of SA58 variants improved antibody potency (defined as 1.1-fold or higher improvement in IC50 compared to wild-type). Among this subset of beneficial mutations, we identify single amino acid mutations that provide a 1.6-fold improvement in Ly-1404 IC50 and a 2.6-fold improvement in SA58 IC50. **(B)** Conceptual representation of viral evolution. Selection for immune evasion drives antibody escape, which fundamentally represents a dynamic change in the underlying fitness landscape for the antibody. This antigenic drift displaces a potent antibody from a peak on the previous fitness landscape (left) to a new starting point at lower activity (right). **(C)** Strip plots visualizing antibody evolution across two rounds. Each point shows the corresponding fold-change in IC₅₀ of pseudovirus neutralization for a designed variant and is colored according to the number of mutations it has (1-8). Consistent with preserving backbone fold, all 55 designed variants across both antibody evolutionary campaigns could be expressed. All round 1 variants are only composed of only single amino acid changes while beneficial mutations are combined in round 2. All round 2 variants have improved neutralization activity compared to their respective wild-type antibody (dotted line). **(D)** Pseudovirus neutralization curves are shown for the most potent evolved antibody variant, consisting of mutations annotated to the left. The top Ly-1404 variant, bearing seven amino acid substitutions in VH, achieves a 26-fold improvement in neutralization against BQ.1.1 (top). The top SA58 variant, bearing single amino acid mutations in both VH and VL, achieves an 11-fold improvement in neutralization against BQ.1.1 (bottom). **(E)** Residues at which mutations improve neutralization against either the structure-encoded strain, BQ.1.1, or both viral strains are highlighted with spheres for antibodies Ly-1404 (PDB 7MMO) and SA58 (PDB 7Y0W). Notably, beneficial mutations are identified both within the binding interface as well distal to the antigen. Neutralization enhancing mutations are labeled in Supplementary Figure 6.

**Figure 4:. Antibodies evolved for high potency also exhibit improved affinity**
**(A)** Ly-1404 antibody variants show a Spearman correlation of 0.47 between apparent affinity fold-change and potency fold-change. Improved affinity is observed to be necessary but not sufficient for improved neutralization activity. Four variants exhibit improved affinity but do not enhance neutralization. All variants with improved neutralization also display improved affinity. The top inverse folding Ly-1404 design with a 27-fold improvement in neutralization has a 9.5-fold improvement in affinity to BQ.1.1 RBD, as measured using BLI. **(C)** SA58 antibodies evolved for improved potency against BQ.1.1 also exhibit improved affinity against VOC XBB.1.5, up to 37-fold. **(B, D)** Representative traces of BLI binding assays for Ly-1404 and SA58 variants with improved affinity.

**Figure 5:. Comparison to other machine learning-guided directed evolution methods**
’Fraction improved’ refers to the hit rate of variants tested which are improved relative to a wildtype protein used as a starting point for directed evolution or a reference protein used as a design template. Higher hit rates indicate more efficient experimental exploration. Inverse folding achieves the highest hit rate with the lowest number of assay-labeled training data points to-date^,–.

See this image and copyright information in PMC

References

1. Chothia C. & Lesk A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986). - PMC - PubMed
1. Bloom J. D., Labthavikul S. T., Otey C. R. & Arnold F. H. Protein stability promotes evolvability. Proc. Natl. Acad. Sci. 103, 5869–5874 (2006). - PMC - PubMed
1. Axe D. D., Foster N. W. & Fersht A. R. A Search for Single Substitutions That Eliminate Enzymatic Function in a Bacterial Ribonuclease. Biochemistry 37, 7157–7166 (1998). - PubMed
1. Romero P. A. & Arnold F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009). - PMC - PubMed
1. Shafikhani S., Siegel R. A., Ferrari E. & Schellenberger V. Generation of large libraries of random mutants in Bacillus subtilis by PCR-based plasmid multimerization. BioTechniques 23, 304–310 (1997). - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution

Affiliations

Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

This is a preprint.

Update in

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous