. 2024 Jul 5;385(6704):46-53.

doi: 10.1126/science.adk8946. Epub 2024 Jul 4.

Unsupervised evolution of protein and antibody complexes with a structure-informed language model

Varun R Shanker^{1

2

3}, Theodora U J Bruun^{2

3

4}, Brian L Hie^{3

4}, Peter S Kim^{3

4

5}

Affiliations

¹ Stanford Biophysics Program, Stanford University School of Medicine, Stanford, CA 94305, USA.
² Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford, CA 94305, USA.
³ Sarafan ChEM-H, Stanford University, Stanford, CA 94305, USA.
⁴ Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305, USA.
⁵ Chan Zuckerberg Biohub, San Francisco, CA 94158, USA.

PMID: 38963838
PMCID: PMC11616794
DOI: 10.1126/science.adk8946

Unsupervised evolution of protein and antibody complexes with a structure-informed language model

Varun R Shanker et al. Science. 2024.

. 2024 Jul 5;385(6704):46-53.

doi: 10.1126/science.adk8946. Epub 2024 Jul 4.

Authors

Varun R Shanker^{1

2

3}, Theodora U J Bruun^{2

3

4}, Brian L Hie^{3

4}, Peter S Kim^{3

4

5}

Affiliations

¹ Stanford Biophysics Program, Stanford University School of Medicine, Stanford, CA 94305, USA.
² Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford, CA 94305, USA.
³ Sarafan ChEM-H, Stanford University, Stanford, CA 94305, USA.
⁴ Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305, USA.
⁵ Chan Zuckerberg Biohub, San Francisco, CA 94158, USA.

PMID: 38963838
PMCID: PMC11616794
DOI: 10.1126/science.adk8946

Abstract

Large language models trained on sequence information alone can learn high-level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here, we show that a general protein language model augmented with protein structure backbone coordinates can guide evolution for diverse proteins without the need to model individual functional tasks. We also demonstrate that ESM-IF1, which was only trained on single-chain structures, can be extended to engineer protein complexes. Using this approach, we screened about 30 variants of two therapeutic clinical antibodies used to treat severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. We achieved up to 25-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants of concern BQ.1.1 and XBB.1.5, respectively. These findings highlight the advantage of integrating structural information to identify efficient protein evolution trajectories without requiring any task-specific training data.

PubMed Disclaimer

Conflict of interest statement

Competing interests

V.R.S., B.L.H., and P.S.K. are named as inventors on a patent application applied for by Stanford University and the Chan Zuckerberg Biohub entitled “Antibody Compositions and Optimization Methods”. B.L.H acknowledges outside interest in Prox Biosciences as a scientific co-founder.

Figures

**Figure 1:. Guiding evolution of diverse proteins with a structure-guided language model**
**(A)** The sequence design problem refers to the prediction of a protein amino acid sequence that will adopt the fold of a given three-dimensional backbone structure; this is conceptually analogous to the inverse problem solved by structure prediction tools like AlphaFold (12). **(B)** A hybrid autoregressive model (11) integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence. Amino acids from the protein sequence are tokenized (red), combined with geometric features extracted from a structural encoder (green), and modeled with an encoder-decoder transformer (purple). **(C)** Our structure-guided framework for protein design indirectly explores the underlying fitness landscape, without modeling a specific definition of fitness or requiring any task-specific training data, by constraining the search space to regions where the backbone fold preserved. **(D)** High fitness sensitivity analysis reveals that multimodal input improves language model performance compared to sequence-only input across 10 proteins from diverse protein families (left). ‘High Fitness Prediction Precision’ is the fraction of the top ten single amino acid substitution predictions that are experimentally determined to confer high protein fitness, defined as having an activity level above the specified percentile threshold among all experimentally screened variants. A representative plot (right) demonstrates this metric for assessing enrichment of high-fitness MAPK1 mutations. Given the vastness of the search space, finding any function-enhancing variant is valuable for most practical settings, and thus only successfully predicted mutations highlighted (blue) on the empirical cumulative density function (ECDF) of the experimental data (black). The three different thresholds, as defined by percentiles, are also shown as dashed lines. Structure-informed language model predictions are more enriched, on average, for high fitness variants across various tested thresholds for high fitness classification. bla, Beta-lactamase TEM; CALM1, Calmodulin-1; haeIIIM, Type II methyltransferase M.HaeIII; HRAS, GTPase HRas; MAPK1, Mitogen-activated protein kinase; TMPT, Thiopurine S-methyltransferase; TPK1, Thiamin pyrophosphokinase 1; UBI4, Polyubiquitin; UBE2I, SUMO-conjugating enzyme UBC9

**Figure 2:. Prediction of antibody-antigen complexes resolves mutational landscapes by implicitly learning features of binding and protein epistasis**
**(A)** Spearman correlation using the structure-informed language model as well as sequence-based modeling approaches ESM-1v (41), AbLang (47), and abYsis (48) reported for three antibodies screened with corresponding antigens. Bars are colored by the type of model used: SILM, Structure-informed Language Model (green); LM, Language Model (orange); and MSA, Multiple Sequence Alignment (purple). The structure-informed language model was evaluated in three different settings: i) providing the entire antibody variable region and antigen complex (Ab-Ag) ii) providing only the antibody variable region (Ab only), and iii) providing only the single antibody variable region of the chain responsible for binding or being mutated (Ab VH only or Ab VH/VL only). Antibody sequences scored by the structure-informed language model with antigen information were computed using input complexes of CR9114 with H5 HA (PDB 4FQI (44)), CR6261 with H1 HA (PDB 3GBN (45)), and g6.31 with VEGF-A (PDB 2FJG (46)). B) Scatter plots showing predictions against experimentally determined dissociation constants of CR6261 against HA-H1(left) and HA-H9 (right). The germline and mature sequences are highlighted on all plots as indicated in the legend. For visualization, all scatter plots omit points on the lower limit of quantitation. **(C)** Conceptual illustration of protein language model performance with improved priors. Providing sequence and structural information of both the antibody and antigen enables the structure-informed language model to most efficiently enrich for high fitness antibody variants (top right, blue square) by identifying and guiding focused sequence exploration (green square) away from regimes of mutations destabilizing to the complex.

**Figure 3.. Evolution of antibodies with a structure-informed language model improves neutralization potency and resilience**
**(A)** Each point represents the fold-change in IC₅₀ of pseudovirus neutralization for antibody variants with single amino acid mutations. Antibodies are tested against the viral strain represented in the input structure (Ly1404- Wuhan, SA58-BA.1 Omicron). A dashed line is shown at fold-change of 1 corresponding to no change. Improved antibody potency is defined as 1.1-fold or higher improvement in IC₅₀ compared to wild-type. **(B)** Conceptual representation of viral evolution. Selection for immune evasion drives antibody escape, which fundamentally represents a dynamic change in the underlying fitness landscape for the antibody. This antigenic drift displaces a potent antibody from a peak on the previous fitness landscape (left) to a new starting point at lower activity (right). **(C)** Strip plots visualizing antibody evolution across two rounds. Each point shows the corresponding fold-change in IC₅₀ of pseudovirus neutralization for a designed variant and is colored according to the number of mutations it has (1–8). Consistent with preserving backbone fold, all 55 designed variants across both antibody evolutionary campaigns could be expressed. All round 1 variants are only composed of only single amino acid changes while beneficial mutations are combined in round 2. All round 2 variants have improved neutralization activity compared to their respective wild-type antibody (dotted line). **(D)** Pseudovirus neutralization curves are shown for the most potent evolved antibody variant, consisting of mutations annotated to the left. The top LY-CoV1404 variant, bearing seven amino acid substitutions in VH, achieves a 25-fold improvement in neutralization against BQ.1.1 (top). The top SA58 variant, bearing single amino acid mutations in both VH and VL, achieves a 14-fold improvement in neutralization against BQ.1.1 (bottom). **(E)** Residues at which mutations improve neutralization against either the structure-encoded strain, BQ.1.1, or both viral strains are highlighted with spheres for antibodies LY-CoV1404 (PDB 7MMO (50)) and SA58 (PDB 7Y0W (54)). Notably, beneficial mutations are identified both within the binding interface as well distal to the antigen. Neutralization enhancing mutations are labeled in Figure S10.

**Figure 4:. Antibodies evolved for high potency also exhibit improved affinity**
**(A)** LY-CoV1404 antibody variants show a Spearman correlation of 0.45 between apparent affinity fold-change and potency fold-change. Improved affinity is observed to be necessary but not sufficient for improved neutralization activity. Four variants exhibit improved affinity but do not enhance neutralization. All variants with improved neutralization also display improved affinity. The top LY-CoV1404 design with a 25-fold improvement in neutralization has a 9.5-fold improvement in affinity to BQ.1.1 RBD, as measured using BLI. **(C)** SA58 antibodies evolved for improved potency against BQ.1.1 also exhibit improved affinity against VOC XBB.1.5, up to 37-fold. **(B, D)** Representative traces of BLI binding assays for LY-CoV1404 and SA58 variants with improved affinity.

See this image and copyright information in PMC

Update of

Inverse folding of protein complexes with a structure-informed language model enables unsupervised antibody evolution.
Shanker VR, Bruun TUJ, Hie BL, Kim PS. Shanker VR, et al. bioRxiv [Preprint]. 2023 Dec 21:2023.12.19.572475. doi: 10.1101/2023.12.19.572475. bioRxiv. 2023. Update in: Science. 2024 Jul 5;385(6704):46-53. doi: 10.1126/science.adk8946. PMID: 38187780 Free PMC article. Updated. Preprint.

Comment in

AI reverse-engineers antibodies.
Kingwell K. Kingwell K. Nat Rev Drug Discov. 2024 Sep;23(9):659. doi: 10.1038/d41573-024-00124-1. Nat Rev Drug Discov. 2024. PMID: 39043932 No abstract available.

Cited by

Protein A-like Peptide Design Based on Diffusion and ESM2 Models.
Zhao L, He Q, Song H, Zhou T, Luo A, Wen Z, Wang T, Lin X. Zhao L, et al. Molecules. 2024 Oct 21;29(20):4965. doi: 10.3390/molecules29204965. Molecules. 2024. PMID: 39459333 Free PMC article.
De Novo Design of Large Polypeptides Using a Lightweight Diffusion Model Integrating LSTM and Attention Mechanism Under Per-Residue Secondary Structure Constraints.
Liao S, Xu G, Jin L, Ma J. Liao S, et al. Molecules. 2025 Feb 28;30(5):1116. doi: 10.3390/molecules30051116. Molecules. 2025. PMID: 40076339 Free PMC article.
AlphaBind, a domain-specific model to predict and optimize antibody-antigen binding affinity.
Agarwal AA, Harrang J, Noble D, McGowan KL, Lange AW, Engelhart E, Lahman MC, Adamo J, Yu X, Serang O, Minch KJ, Wellman KY, Younger DA, Lopez RM, Emerson RO. Agarwal AA, et al. MAbs. 2025 Dec;17(1):2534626. doi: 10.1080/19420862.2025.2534626. Epub 2025 Jul 22. MAbs. 2025. PMID: 40693434 Free PMC article.
Leveraging large language models to predict antibody biological activity against influenza A hemagglutinin.
Barkan E, Siddiqui I, Cheng KJ, Golts A, Shoshan Y, Weber JK, Campos Mota Y, Ozery-Flato M, Sautto GA. Barkan E, et al. Comput Struct Biotechnol J. 2025 Mar 24;27:1286-1295. doi: 10.1016/j.csbj.2025.03.038. eCollection 2025. Comput Struct Biotechnol J. 2025. PMID: 40230408 Free PMC article.
The Nobel Prize in Chemistry: past, present, and future of AI in biology.
Abriata LA. Abriata LA. Commun Biol. 2024 Oct 29;7(1):1409. doi: 10.1038/s42003-024-07113-5. Commun Biol. 2024. PMID: 39472680 Free PMC article.

See all "Cited by" articles

References

1. Chothia C, Lesk AM, The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986). - PMC - PubMed
1. Bloom JD, Labthavikul ST, Otey CR, Arnold FH, Protein stability promotes evolvability. Proc. Natl. Acad. Sci. 103, 5869–5874 (2006). - PMC - PubMed
1. Axe DD, Foster NW, Fersht AR, A Search for Single Substitutions That Eliminate Enzymatic Function in a Bacterial Ribonuclease. Biochemistry 37, 7157–7166 (1998). - PubMed
1. Romero PA, Arnold FH, Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009). - PMC - PubMed
1. Shafikhani S, Siegel RA, Ferrari E, Schellenberger V, Generation of large libraries of random mutants in Bacillus subtilis by PCR-based plasmid multimerization. BioTechniques 23, 304–310 (1997). - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Supplementary concepts

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unsupervised evolution of protein and antibody complexes with a structure-informed language model

Affiliations

Unsupervised evolution of protein and antibody complexes with a structure-informed language model

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Supplementary concepts

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Update of

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Supplementary concepts

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous