Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Sep 14:2023.09.11.557288.
doi: 10.1101/2023.09.11.557288.

An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies

Affiliations

An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies

Yiquan Wang et al. bioRxiv. .

Update in

Abstract

Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and inaccessibility of datasets for model training. In this study, we curated a dataset of >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM captured key sequence motifs of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of antibody response to influenza virus, but also provides an invaluable resource for applying deep learning to antibody research.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS N.C.W. consults for HeliXon. The authors declare no other competing interests.

Figures

Figure 1.
Figure 1.. Germline gene usages in influenza HA antibodies.
(A) The IGHV gene usage, (B) IGK(L)V gene usage, and (C) IGHD gene usage in antibodies to HA head domain (orange) and HA stem domain (blue). For comparison, germline gene usages of all antibodies from Genbank are also shown (green). To avoid being confounded by B-cell clonal expansion, a single clonotype from the same donor is considered as one antibody (see Methods).
Figure 2.
Figure 2.. Hydrophobicity of CDR H3 sequences.
(A-B) The hydrophobicity scores of (A) CDR H3 and (B) CDR H3 tip, as well as (C) the CDR H3 length are compared between antibodies to HA head and HA stem domains. The p-values were computed by two-tailed Student’s t-tests. For the boxplot, the middle horizontal line represents the median. The lower and upper hinges represent the first and third quartiles, respectively. The upper whisker extends to the highest data point within 1.5x inter-quartile range (IQR) of the third quartile, whereas the lower whisker extends to the lowest data point within 1.5x IQR of the first quartile. Each data point represents one antibody. The horizontal dotted line indicates the mean among antibodies from Genbank.
Figure 3.
Figure 3.. Antibody specificity prediction by memory B cell language model (mBLM).
(A) Model architecture of mBLM is shown. Arrows indicate the information flow in the network from the language model to antibody specificity prediction, with a final output of specificity class probability. Resi Rep: residual level representation (i.e. the final-layer embeddings from pre-trained mBLM). (B) Model performance of mBLM on the test set was evaluated by a normalized confusion matrix. (C) The performance of different antibody specificity prediction models was evaluated by F1 score, which represents the weighted harmonic mean of the precision and recall. CDR encoders: our previous model using a transformer encoder to encode CDR sequences [15]. ESM2: a general protein language model [18].
Figure 4.
Figure 4.. Explanation of mBLM using saliency score.
(A) Saliency score for each residue in individual HA stem antibodies was shown as a heatmap. Each row represents a single HA stem antibody. X-axis represents the amino acid residue of the heavy chain. Regions corresponding to CDR H1, H2, and H3 are indicated. For visualization purpose, only 50 HA stem antibodies are shown. Six clusters of HA stem antibodies were identified using hierarchical clustering with Ward’s method. (B) IGHD gene usage among antibodies in cluster 3 is shown. (C) The saliency score of each CDR H3 residue in IGHD3–9 antibodies within cluster 3 was analyzed. The frequency of each amino acid for residues with a saliency score >0.5 is shown as a sequence logo. Arrows at the bottom indicate the residues of interest. (D) Saliency scores are projected on to the structures of four antibodies in cluster 3 (PDB 4KVN [49], PDB 5KAQ [42], PDB 8GV6 [54], and PDB 3ZTJ [47]). The color scheme is same as that in panel A. (E) The relationship between saliency score and distance to the antigen (i.e. HA stem) is shown as a scatter plot. Spearman’s rank correlation coefficient (ρ) is indicated. A total of 18 structures of HA stem antibodies in complex with HA were analyzed (PDB 3FKU, 3GBN, 3SDY, 3ZTJ, 4FQI, 4KVN, 4NM8, 4R8W, 5JW3, 5KAN, 5KAQ, 5K9K, 5K9O, 5K9Q, 5WKO, 6E3H, 6NZ7, and 8GV6) [, , , –54].
Figure 5.
Figure 5.. Discovery of HA stem antibody by mBLM.
(A-B) mBLM was applied to predict the specificity of (A) 60 antibodies to central stem epitope (left panel) and 38 to anchor stem epitope (right panel) that were reported recently [57], as well as (B) 4,452 HA antibodies with unknown epitopes (HA unk) in the dataset that we assembled. The fraction of antibodies that were predicted to bind to HA stem domain (Predicted as HA stem), HA head domain (Predicted as HA head), or to other antigens (Not predicted as HA) is shown. (C) Using ELISA, the binding of 18 HA unk antibodies that were predicted as HA stem antibodies was tested against mini-HA, which is an H1 stem-based construct [58]. Four known HA stem antibodies (051–09 5A02, 051–09 5E03, 310–18C3, and FI6v3) [47, 63, 64] were included as positive control. D2 H1–1/H3–1, which is a known HA head antibody [65], was included as negative control. In this binding experiment, antibodies were not purified from the supernatant and thus their concentrations were unknown. (D) Representative 2D classes from cryo-EM analysis of 310–18A5 Fab in complex with H1N1 A/Solomon Islands/3/2006 (SI06) HA are shown. Cyan arrows point to the 310–18A5 Fabs. (E) Cryo-EM 3D reconstruction of 310–18A5 Fab in complex with SI06 HA. Structural models of SI06 HA (PDB 6XSK) [66] and CR9114 (PDB 4FQH) [48] were docked into the 3D reconstruction.

Similar articles

Cited by

References

    1. Graham BS, Gilman MSA, McLellan JS. Structure-based vaccine antigen design. Annu Rev Med. 2019;70:91–104. doi: 10.1146/annurev-med-121217-094234. - DOI - PMC - PubMed
    1. Lu RM, Hwang YC, Liu IJ, Lee CC, Tsai HZ, Li HJ, et al. Development of therapeutic antibodies for the treatment of diseases. J Biomed Sci. 2020;27(1):1. doi: 10.1186/s12929-019-0592-z. - DOI - PMC - PubMed
    1. Winters A, McFadden K, Bergen J, Landas J, Berry KA, Gonzalez A, et al. Rapid single B cell antibody discovery using nanopens and structured light. mAbs. 2019;11(6):1025–35. doi: 10.1080/19420862.2019.1624126. - DOI - PMC - PubMed
    1. Curtis NC, Lee J. Beyond bulk single-chain sequencing: Getting at the whole receptor. Curr Opin Syst Biol. 2020;24:93–9. doi: 10.1016/j.coisb.2020.10.008. - DOI - PMC - PubMed
    1. Briney B, Inderbitzin A, Joyce C, Burton DR. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature. 2019;566(7744):393–7. doi: 10.1038/s41586-019-0879-y. - DOI - PMC - PubMed

Publication types