Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 15;14(11):979-989.e4.
doi: 10.1016/j.cels.2023.10.001. Epub 2023 Oct 30.

IgLM: Infilling language modeling for antibody sequence design

Affiliations

IgLM: Infilling language modeling for antibody sequence design

Richard W Shuai et al. Cell Syst. .

Abstract

Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.

Keywords: antibodies; deep learning; language modeling.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests R.W.S., J.A.R., and J.J.G. are inventors of the IgLM technology developed in this study. The Johns Hopkins University has filed international patent application PCT/US2022/052178 Generative Language Models and Related Aspects for Peptide and Protein Sequence Design, which relates to the IgLM technology. R.W.S., J.A.R., and J.J.G. may be entitled to a portion of revenue received from commercial licensing of the IgLM technology and any intellectual property therein. J.J.G. is an unpaid member of the Executive Board of the Rosetta Commons. Under an institutional participation agreement between the University of Washington, acting on behalf of the Rosetta Commons, and the Johns Hopkins University (JHU), JHU may be entitled to a portion of revenue received on licensing of Rosetta software used in this paper. J.J.G. has a financial interest in Cyrus Biotechnology. Cyrus Biotechnology distributes the Rosetta software, which may include methods used in this paper. These arrangements have been reviewed and approved by the Johns Hopkins University in accordance with its conflict-of-interest policies.

Figures

Figure 1
Figure 1
Overview of IgLM model for antibody sequence generation. (A) IgLM is trained by autoregressive language modeling of reordered antibody sequence segments, conditioned on chain and species identifier tags. (B) Distribution of sequences in clustered OAS dataset for various species and chain types. (C) Effect of increased sampling temperature for full-length generation. Structures at each temperature are predicted by AlphaFold-Multimer and colored by prediction confidence (pLDDT), with blue being the most confident and orange being the least [n = 170]. (D) CDR loop infilling perplexity for IgLM and ProGen2 models on heldout test dataset of 30M sequences. IgLM models are evaluated with bidirectional infilling context ([bi]) and preceding context only ([pre]). Confidence intervals calculated from boostrapping (100 samples) had a width less than 0.01 and are therefore not shown.
Figure 2
Figure 2
Controllable antibody sequence generation. (A) Diagram of procedure for generating full-length antibody sequences given a desired species and chain type with IgLM. (B) Length of generated heavy and light with and without initial three residues provided (prompting). (C-E) Analysis of full-length generated sequences under different conditioning settings [n = 220,000]. (C) Adherence of generated sequences to species conditioning tags. Each plot shows the species classifications of antibody sequences generated with a particular species conditioning tag (indicated above plots). Solid and dashed lines correspond to sequences generated with heavy- and light-chain conditioning, respectively. (D) Adherence of generated sequences to chain conditioning tags. Top plot shows the percentage of heavy-chain-conditioned sequences classified as heavy chains, for each species conditioning tag. Lower plots show the percentage of light-chain-conditioned sequences, further divided by whether initial residues were characteristic of lambda or kappa chains, classified as lambda or kappa chains. (E) Effect of sampling temperature on germline identity for generated heavy and light chain sequences. As sampling temperature increases, generated sequences diverge from the closest germline V- and J-gene sequences.
Figure 3
Figure 3
Generation of infilled therapeutic antibody libraries. (A) Diagram of procedure for generating diverse antibody libraries by infilling the CDR H3 loops of therapeutic antibodies. (B) Distribution of infilled CDR H3 loop lengths for 49 therapeutic antibodies. Parent CDR H3 lengths are indicated in red. (C) Relationship between sampling temperature (T) and nucleus probability (P) and length of infilled CDR H3 loops [n = 432,763]. (D) Infilled CDR H3 loops for trastuzumab therapeutic antibody adopt diverse lengths and conformations. Structures for infilled variants are predicted with IgFold [n = 432,763]. (E) Distribution of infilled CDR H3 loop lengths for therapeutic antibodies grouped by nearest germline gene groups [n = 432,763]. (F-G) Effect of sampling temperature (T) and nucleus probability (P) (P) on diversity of infilled CDR H3 loops for lengths between 10 and 18 residues [n = 432,763]. Pairwise edit distance measures the minimum edits between each infilled loop to another in the same set of generated sequences (i.e., within the set of sequences produced with the same T and P parameters). For both parameters, less restrictive sampling produces greater infilled loop diversity.
Figure 4
Figure 4
Therapeutic properties of infilled antibody libraries. Asterisks indicate statistical significance (p ¡ 0.001) from a one-sample t-test (A, B, D) or a two-sample t-test (E). (A) Change in predicted aggregation propensity of infilled sequences relative to their parent antibodies. Infilled sequences display reduced aggregation propensity (negative is improved), particularly for shorter loops [n = 432,763]. (B) Change in predicted solubility of infilled sequences relative to their parent antibodies. Infilled sequences display increased solubility (positive is improved) [n = 432,763]. (C) Relationship between predicted changes in aggregation propensity and solubility for infilled sequence libraries [n = 432,763]. (D) Change in humanness of infilled sequences relative to their parent antibodies. Humanness is calculated as the OASis identity of the heavy chain sequence, with positive larger values being more human-like [n = 432,763]. (E) Relationship between sampling temperature (T) and nucleus probability (P) and change in human-likeness (OASis identity) of infilled heavy chains relative to their parent sequences [n = 432,763]. (F-G) Comparison of infilled library developability generated using alternative language models for loops with lengths between six and seventeen residues [n = 1,709,696]. (F) Change in predicted aggregation propensity for infilling methods. (G) Change in predicted solubility for infilling methods. (H) Change in humanness for infilling methods. (I) Receiver operating characteristic (ROC) curves for human sequence classification methods [n = 487]. The area under the curve (AUC) is shown for each method.

Comment in

  • Becoming fluent in proteins.
    Leem J, Galson JD. Leem J, et al. Cell Syst. 2023 Nov 15;14(11):923-924. doi: 10.1016/j.cels.2023.10.008. Cell Syst. 2023. PMID: 37972558

References

    1. Akbar R, Robert PA, Weber CR, Widrich M, Frank R, Pavlović M, Scheffer L, Chernigovskaya M, Snapkov I, Slabodkin A et al. (2022). In silico proof of principle of machine learning-based antibody design at unconstrained scale. Mabs 14, 2031482. - PMC - PubMed
    1. Almagro JC, Pedraza-Escalona M, Arrieta HI and Pérez-Tapia SM (2019). Phage display libraries for antibody therapeutic discovery and development. Antibodies 8, 44. - PMC - PubMed
    1. Bachas S, Rakocevic G, Spencer D, Sastry AV, Haile R, Sutton JM, Kasun G, Stachyra A, Gutierrez JM, Yassine E et al. (2022). Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness. bioRxiv, 2022–08.
    1. Chennamsetty N, Voynov V, Kayser V, Helk B and Trout BL (2010). Prediction of aggregation prone regions of therapeutic proteins. The Journal of Physical Chemistry B 114, 6614–6624. - PubMed
    1. Chothia C and Lesk AM (1987). Canonical structures for the hypervariable regions of immunoglobulins. Journal of molecular biology 196, 901–917. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources