Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 5;385(6704):46-53.
doi: 10.1126/science.adk8946. Epub 2024 Jul 4.

Unsupervised evolution of protein and antibody complexes with a structure-informed language model

Affiliations

Unsupervised evolution of protein and antibody complexes with a structure-informed language model

Varun R Shanker et al. Science. .

Abstract

Large language models trained on sequence information alone can learn high-level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here, we show that a general protein language model augmented with protein structure backbone coordinates can guide evolution for diverse proteins without the need to model individual functional tasks. We also demonstrate that ESM-IF1, which was only trained on single-chain structures, can be extended to engineer protein complexes. Using this approach, we screened about 30 variants of two therapeutic clinical antibodies used to treat severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. We achieved up to 25-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants of concern BQ.1.1 and XBB.1.5, respectively. These findings highlight the advantage of integrating structural information to identify efficient protein evolution trajectories without requiring any task-specific training data.

PubMed Disclaimer

Conflict of interest statement

Competing interests

V.R.S., B.L.H., and P.S.K. are named as inventors on a patent application applied for by Stanford University and the Chan Zuckerberg Biohub entitled “Antibody Compositions and Optimization Methods”. B.L.H acknowledges outside interest in Prox Biosciences as a scientific co-founder.

Figures

Figure 1:
Figure 1:. Guiding evolution of diverse proteins with a structure-guided language model
(A) The sequence design problem refers to the prediction of a protein amino acid sequence that will adopt the fold of a given three-dimensional backbone structure; this is conceptually analogous to the inverse problem solved by structure prediction tools like AlphaFold (12). (B) A hybrid autoregressive model (11) integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence. Amino acids from the protein sequence are tokenized (red), combined with geometric features extracted from a structural encoder (green), and modeled with an encoder-decoder transformer (purple). (C) Our structure-guided framework for protein design indirectly explores the underlying fitness landscape, without modeling a specific definition of fitness or requiring any task-specific training data, by constraining the search space to regions where the backbone fold preserved. (D) High fitness sensitivity analysis reveals that multimodal input improves language model performance compared to sequence-only input across 10 proteins from diverse protein families (left). ‘High Fitness Prediction Precision’ is the fraction of the top ten single amino acid substitution predictions that are experimentally determined to confer high protein fitness, defined as having an activity level above the specified percentile threshold among all experimentally screened variants. A representative plot (right) demonstrates this metric for assessing enrichment of high-fitness MAPK1 mutations. Given the vastness of the search space, finding any function-enhancing variant is valuable for most practical settings, and thus only successfully predicted mutations highlighted (blue) on the empirical cumulative density function (ECDF) of the experimental data (black). The three different thresholds, as defined by percentiles, are also shown as dashed lines. Structure-informed language model predictions are more enriched, on average, for high fitness variants across various tested thresholds for high fitness classification. bla, Beta-lactamase TEM; CALM1, Calmodulin-1; haeIIIM, Type II methyltransferase M.HaeIII; HRAS, GTPase HRas; MAPK1, Mitogen-activated protein kinase; TMPT, Thiopurine S-methyltransferase; TPK1, Thiamin pyrophosphokinase 1; UBI4, Polyubiquitin; UBE2I, SUMO-conjugating enzyme UBC9
Figure 2:
Figure 2:. Prediction of antibody-antigen complexes resolves mutational landscapes by implicitly learning features of binding and protein epistasis
(A) Spearman correlation using the structure-informed language model as well as sequence-based modeling approaches ESM-1v (41), AbLang (47), and abYsis (48) reported for three antibodies screened with corresponding antigens. Bars are colored by the type of model used: SILM, Structure-informed Language Model (green); LM, Language Model (orange); and MSA, Multiple Sequence Alignment (purple). The structure-informed language model was evaluated in three different settings: i) providing the entire antibody variable region and antigen complex (Ab-Ag) ii) providing only the antibody variable region (Ab only), and iii) providing only the single antibody variable region of the chain responsible for binding or being mutated (Ab VH only or Ab VH/VL only). Antibody sequences scored by the structure-informed language model with antigen information were computed using input complexes of CR9114 with H5 HA (PDB 4FQI (44)), CR6261 with H1 HA (PDB 3GBN (45)), and g6.31 with VEGF-A (PDB 2FJG (46)). B) Scatter plots showing predictions against experimentally determined dissociation constants of CR6261 against HA-H1(left) and HA-H9 (right). The germline and mature sequences are highlighted on all plots as indicated in the legend. For visualization, all scatter plots omit points on the lower limit of quantitation. (C) Conceptual illustration of protein language model performance with improved priors. Providing sequence and structural information of both the antibody and antigen enables the structure-informed language model to most efficiently enrich for high fitness antibody variants (top right, blue square) by identifying and guiding focused sequence exploration (green square) away from regimes of mutations destabilizing to the complex.
Figure 3.
Figure 3.. Evolution of antibodies with a structure-informed language model improves neutralization potency and resilience
(A) Each point represents the fold-change in IC50 of pseudovirus neutralization for antibody variants with single amino acid mutations. Antibodies are tested against the viral strain represented in the input structure (Ly1404- Wuhan, SA58-BA.1 Omicron). A dashed line is shown at fold-change of 1 corresponding to no change. Improved antibody potency is defined as 1.1-fold or higher improvement in IC50 compared to wild-type. (B) Conceptual representation of viral evolution. Selection for immune evasion drives antibody escape, which fundamentally represents a dynamic change in the underlying fitness landscape for the antibody. This antigenic drift displaces a potent antibody from a peak on the previous fitness landscape (left) to a new starting point at lower activity (right). (C) Strip plots visualizing antibody evolution across two rounds. Each point shows the corresponding fold-change in IC50 of pseudovirus neutralization for a designed variant and is colored according to the number of mutations it has (1–8). Consistent with preserving backbone fold, all 55 designed variants across both antibody evolutionary campaigns could be expressed. All round 1 variants are only composed of only single amino acid changes while beneficial mutations are combined in round 2. All round 2 variants have improved neutralization activity compared to their respective wild-type antibody (dotted line). (D) Pseudovirus neutralization curves are shown for the most potent evolved antibody variant, consisting of mutations annotated to the left. The top LY-CoV1404 variant, bearing seven amino acid substitutions in VH, achieves a 25-fold improvement in neutralization against BQ.1.1 (top). The top SA58 variant, bearing single amino acid mutations in both VH and VL, achieves a 14-fold improvement in neutralization against BQ.1.1 (bottom). (E) Residues at which mutations improve neutralization against either the structure-encoded strain, BQ.1.1, or both viral strains are highlighted with spheres for antibodies LY-CoV1404 (PDB 7MMO (50)) and SA58 (PDB 7Y0W (54)). Notably, beneficial mutations are identified both within the binding interface as well distal to the antigen. Neutralization enhancing mutations are labeled in Figure S10.
Figure 4:
Figure 4:. Antibodies evolved for high potency also exhibit improved affinity
(A) LY-CoV1404 antibody variants show a Spearman correlation of 0.45 between apparent affinity fold-change and potency fold-change. Improved affinity is observed to be necessary but not sufficient for improved neutralization activity. Four variants exhibit improved affinity but do not enhance neutralization. All variants with improved neutralization also display improved affinity. The top LY-CoV1404 design with a 25-fold improvement in neutralization has a 9.5-fold improvement in affinity to BQ.1.1 RBD, as measured using BLI. (C) SA58 antibodies evolved for improved potency against BQ.1.1 also exhibit improved affinity against VOC XBB.1.5, up to 37-fold. (B, D) Representative traces of BLI binding assays for LY-CoV1404 and SA58 variants with improved affinity.

Update of

Comment in

  • AI reverse-engineers antibodies.
    Kingwell K. Kingwell K. Nat Rev Drug Discov. 2024 Sep;23(9):659. doi: 10.1038/d41573-024-00124-1. Nat Rev Drug Discov. 2024. PMID: 39043932 No abstract available.

Similar articles

Cited by

References

    1. Chothia C, Lesk AM, The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986). - PMC - PubMed
    1. Bloom JD, Labthavikul ST, Otey CR, Arnold FH, Protein stability promotes evolvability. Proc. Natl. Acad. Sci. 103, 5869–5874 (2006). - PMC - PubMed
    1. Axe DD, Foster NW, Fersht AR, A Search for Single Substitutions That Eliminate Enzymatic Function in a Bacterial Ribonuclease. Biochemistry 37, 7157–7166 (1998). - PubMed
    1. Romero PA, Arnold FH, Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009). - PMC - PubMed
    1. Shafikhani S, Siegel RA, Ferrari E, Schellenberger V, Generation of large libraries of random mutants in Bacillus subtilis by PCR-based plasmid multimerization. BioTechniques 23, 304–310 (1997). - PubMed

Publication types

Supplementary concepts

LinkOut - more resources