Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 8;57(10):2453-2465.e7.
doi: 10.1016/j.immuni.2024.07.022. Epub 2024 Aug 19.

An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies

Affiliations

An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies

Yiquan Wang et al. Immunity. .

Abstract

Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and the inaccessibility of datasets for model training. In this study, we curated >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM could identify key sequence features of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of the antibody response to the influenza virus but also provides a valuable resource for applying deep learning to antibody research.

Keywords: antibody; data mining; deep learning; hemagglutinin; influenza virus; language model; somatic hypermutations.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests N.C.W. consults for HeliXon.

Figures

Figure 1.
Figure 1.. Germline gene usages in influenza HA antibodies.
(A) The IGHV gene usage, (B) IGK(L)V gene usage, (C) IGHD gene usage, (D) IGHJ gene usage, and (E) IGK(L)V gene usage in antibodies to HA head domain (orange) and HA stem domain (blue). For comparison, germline gene usages of all antibodies from GenBank are also shown (green). To avoid being confounded by B-cell clonal expansion, a single clonotype from the same donor is considered as one antibody (see STAR Methods). Error bars represent the standard deviation computed from binomial distribution.
Figure 2.
Figure 2.. Hydrophobicity of CDR H3 sequences.
(A-B) The hydrophobicity scores of (A) CDR H3 and (B) CDR H3 tip, as well as (C) the CDR H3 length are compared between antibodies to HA head and HA stem domains. The p-values were computed by two-tailed Student’s t-tests. For the boxplot, the middle horizontal line represents the median. The lower and upper hinges represent the first and third quartiles, respectively. The upper whisker extends to the highest data point within 1.5x inter-quartile range (IQR) of the third quartile, whereas the lower whisker extends to the lowest data point within 1.5x IQR of the first quartile. Each data point represents one antibody. The horizontal dotted line indicates the mean among antibodies from GenBank.
Figure 3.
Figure 3.. Antibody specificity prediction by memory B cell language model (mBLM).
(A) Model architecture of mBLM is shown. Arrows indicate the information flow in the network from the language model to antibody specificity prediction, with a final output of specificity class probability. Resi Rep: residual level representation (i.e. the final-layer embeddings from pre-trained mBLM). (B) The performance of different antibody specificity prediction models was evaluated by F1 score, which represents the globally arithmetic mean of the harmonic means of precision and recall. Error bar represents standard deviation of 15-fold cross-validation. KNN: a baseline model using k-nearest neighbors algorithm. ESM2-Ab: pretrained protein language model ESM2 was finetuned (same as mBLM) for antibody specificity prediction. (C) Model performance of mBLM on the test set was evaluated by a normalized confusion matrix.
Figure 4.
Figure 4.. Explanation of mBLM using saliency score.
(A) Saliency score for each residue in individual HA stem antibodies was shown as a heatmap. Each row represents a single HA stem antibody. X-axis represents the amino acid residue of the heavy chain. Regions corresponding to CDR H1, H2, H3, and DE loop are indicated. For visualization purpose, only 50 HA stem antibodies are shown. Six clusters of HA stem antibodies were identified using hierarchical clustering with Ward’s method. (B) IGHD gene usage among antibodies in cluster 3 is shown. (C) The saliency score of each CDR H3 residue in IGHD3–9 antibodies within cluster 3 was analyzed. The frequency of each amino acid for residues with a saliency score >0.5 is shown as a sequence logo. Arrows at the bottom indicate the residues of interest. (D) Saliency scores are projected on to the structures of four antibodies in cluster 3, namely 39.29 (PDB 4KVN), 31.a.83 (PDB 5KAQ), PN-SIA28 (PDB 8GV6), and FI6v3 (PDB 3ZTJ). The color scheme is same as that in panel A. (E) The relationship between saliency score and distance to the antigen (i.e. HA stem) is shown as a scatter plot. Spearman’s rank correlation coefficient (ρ) is indicated. A total of 18 structures of HA stem antibodies in complex with HA were analyzed (PDB 3FKU, 3GBN, 3SDY, 3ZTJ, 4FQI, 4KVN, 4NM8, 4R8W, 5JW3, 5KAN, 5KAQ, 5K9K, 5K9O, 5K9Q, 5WKO, 6E3H, 6NZ7, and 8GV6),,,–.
Figure 5.
Figure 5.. Sequence determinants for the HA-stem binding activity of C1–3.7F02.
(A) Saliency score of each residue of C1–3.7F02 is shown as a bar chart. Residues that represent somatic hypermutations are colored in red. Residues of interest are labeled. (B) The binding affinity of C1–3.7F02 WT (black) N58S mutant (red), and W100aA mutant (blue) IgGs against H3 mini-HA was measured by ELISA. Their EC50 values are indicated. 3A10 is an influenza neuraminidase antibody and serves as a negative control here. (C) Binding kinetics of different Fabs against recombinant H3 mini-HA were measured by biolayer interferometry (BLI). The y-axis represents the response. Blue lines represent the response curves, and red lines represent a 1:1 binding model. Binding kinetics were measured for four concentrations of Fab at 3-fold dilution ranging from 300 nM to 33 nM. Dissociation constant (KD) and the goodness of model fitting (R2) are indicated. (D) Sequence alignment of IGHV1–46 germline sequence with the heavy chain sequences of C1–3.7F02 and HMCON10 was performed using MAFFT. Residues that represent somatic hypermutations in the V gene are colored in red.
Figure 6.
Figure 6.. Sequence determinants for the HA-stem binding activity of 013–10 3F02.
(A) (F) Saliency score of each residue of 013–10 3F02 is shown as a bar chart. Residues that represent somatic hypermutations are colored in red. Residue 56 is labeled. (B) The binding affinity of 013–10 3F02 WT (black) and K56N mutant (red) IgGs against H1 mini-HA was measured by ELISA. Their EC50 values are indicated. 3A10 is an influenza neuraminidase antibody and serves as a negative control here. (C) Sequence alignment of IGHV3–30 germline sequence with the heavy chain sequences of 013–10 3F02, 3I14, FI3082, 310–18C10, and 81.39 was performed using MAFFT. Residues that represent somatic hypermutations in the V gene are colored in red.
Figure 7.
Figure 7.. Discovery of HA stem antibody by mBLM.
(A-B) mBLM was applied to predict the specificity of (A) 60 antibodies to central stem epitope (left panel) and 38 to anchor stem epitope (right panel) that have been reported, as well as (B) 4,453 HA antibodies with unknown epitopes (HA unk) in the dataset that we assembled. The fraction of antibodies that were predicted to bind to HA stem domain (Predicted as HA stem), HA head domain (Predicted as HA head), or to other antigens (Not predicted as HA) is shown. (C) Using ELISA, the binding of 30 HA unk antibodies that were predicted as HA stem antibodies was tested against H1 mini-HA and H3 mini-HA, both of which were HA stem-only constructs. The confidence score of each of these antibodies as HA stem antibody as well as their sequence divergence to the most similar antibodies in the training set (min dist to training set) are shown as heatmaps. Four known HA stem antibodies (051–09 5A02, 051–09 5E03, 310–18C3, and FI6v3),, were included as positive control. D2 H1–1/H3–1, which is a known HA head antibody, was included as negative control. In this binding experiment, antibodies were not purified from the supernatant and thus their concentrations were unknown.

Update of

References

    1. Graham BS, Gilman MSA, and McLellan JS (2019). Structure-based vaccine antigen design. Annu Rev Med 70, 91–104. 10.1146/annurev-med-121217-094234. - DOI - PMC - PubMed
    1. Lu RM, Hwang YC, Liu IJ, Lee CC, Tsai HZ, Li HJ, and Wu HC (2020). Development of therapeutic antibodies for the treatment of diseases. J Biomed Sci 27, 1. 10.1186/s12929-019-0592-z. - DOI - PMC - PubMed
    1. Winters A, McFadden K, Bergen J, Landas J, Berry KA, Gonzalez A, Salimi-Moosavi H, Murawsky CM, Tagari P, and King CT (2019). Rapid single B cell antibody discovery using nanopens and structured light. mAbs 11, 1025–1035. 10.1080/19420862.2019.1624126. - DOI - PMC - PubMed
    1. Curtis NC, and Lee J (2020). Beyond bulk single-chain sequencing: Getting at the whole receptor. Curr Opin Syst Biol 24, 93–99. 10.1016/j.coisb.2020.10.008. - DOI - PMC - PubMed
    1. Briney B, Inderbitzin A, Joyce C, and Burton DR (2019). Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397. 10.1038/s41586-019-0879-y. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources