Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 11;20(11):e1012511.
doi: 10.1371/journal.pcbi.1012511. eCollection 2024 Nov.

A modular protein language modelling approach to immunogenicity prediction

Affiliations

A modular protein language modelling approach to immunogenicity prediction

Hugh O'Brien et al. PLoS Comput Biol. .

Abstract

Neoantigen immunogenicity prediction is a highly challenging problem in the development of personalised medicines. Low reactivity rates in called neoantigens result in a difficult prediction scenario with limited training datasets. Here we describe ImmugenX, a modular protein language modelling approach to immunogenicity prediction for CD8+ reactive epitopes. ImmugenX comprises of a pMHC encoding module trained on three pMHC prediction tasks, an optional TCR encoding module and a set of context specific immunogenicity prediction head modules. Compared with state-of-the-art models for each task, ImmugenX's encoding module performs comparably or better on pMHC binding affinity, eluted ligand prediction and stability tasks. ImmugenX outperforms all compared models on pMHC immunogenicity prediction (Area under the receiver operating characteristic curve = 0.619, average precision: 0.514), with a 7% increase in average precision compared to the next best model. ImmugenX shows further improved performance on immunogenicity prediction with the integration of TCR context information. ImmugenX performance is further analysed for interpretability, which locates areas of weakness found across existing immunogenicity models and highlight possible biases in public datasets.

PubMed Disclaimer

Conflict of interest statement

I have read the journal’s policy and the authors of this manuscript have the following competing interests: S.A.Q. is co-founder and chief scientific officer and own shares in Achilles Therapeutics. C.S. acknowledges grants from AstraZeneca, Boehringer-Ingelheim, Bristol Myers Squibb, Pfizer, Roche-Ventana, Invitae (previously Archer Dx Inc., a collaboration in minimal residual disease sequencing technologies), Ono Pharmaceutical and Personalis. He is chief investigator for the AZ MeRmaiD 1 and 2 clinical trials and is the steering committee chair. He is also co-chief investigator of the NHS Galleri trial funded by GRAIL and a paid member of GRAIL’s scientific advisory board (SAB). He receives consultant fees from Achilles Therapeutics (also an SAB member); Bicycle Therapeutics (also an SAB member); Genentech; Medicxi; the China Innovation Centre of Roche (CICoR), formerly Roche Innovation Centre Shanghai; Metabomed (until July 2022); Relay Therapeutics; and the Sarah Cannon Research Institute. C.S has received honoraria from Amgen, AstraZeneca, Bristol Myers Squibb, GlaxoSmithKline, Illumina, MSD, Novartis, Pfizer and Roche-Ventana; previously held stock options in Apogen Biotechnologies and GRAIL; currently has stock options in Epic Bioscience and Bicycle Therapeutics; and has stock options and is co-founder of Achilles Therapeutics. S.R.H. is the cofounder of PokeAcell and is co-inventor of licensed patents related to T cell detection. H.O’B., M.S., L.M. and F.O’F. are employees of Achilles Therapeutics. H.O’B. and M.S. are both named inventors on patents for ImmugenX.

Figures

Fig 1
Fig 1. ImmugenX pMHC pre-trained tasks module.
A: The model architecture of the pMHC module of ImmugenX, trained to perform 3 tasks in series. Weights of the transformer blocks are saved for use in all downstream models for immunogenicity and TCR specificity. B: Performance of pMHC module on the binding affinity task predicting unseen pMHC affinities against MHCFlurry 2.0. Left: ROC curve for predicting strong affinities (≤ 500nM), Right: Pearson’s correlation coefficient between predictions and measured binding affinities. C: Performance on the eluted ligand holdout set compared against BigMHC and netMHCpan 4.1. Left: ROC curve, Right: Scatter plots for ImmugenX against BigMHC-EL and netMHCpan 4.1 per HLA in the test set. D: Pearson’s correlation coefficient for HLAs in the stability dataset during cross-validation, comparing both the ImmugenX pMHC module trained from scratch and fine-tuned on the BA-EL tasks.
Fig 2
Fig 2. pMHC immunogenicity prediction with ImmugenX.
A: Architecture of ImmugenX-pMHC. The pretrained pMHC module provides a combined encoding of the pMHC for input to an immunogenicity classification transformer module which is trained on immunogenicity datasets. B: ROC curve on the cancer immunogenicity holdout test set against other modelling approaches. The EL and Stability pMHC modules for ImmugenX are shown also. ImmugenX is shown trained with both of these as a base module. C: Precision recall curves for the holdout test dataset. D: Composition and filtering steps for the holdout cancer test set. CEDAR is taken with supplementary datasets from both non-overlapping data points in other public data sources and test sets from compared models’ publications not already in larger public databases.
Fig 3
Fig 3. The method for integrating TCRs into ImmugenX predictions evaluated by performance on TCR specificity tasks.
A: Masking method to train a TCR chain encoder, performed iteratively first for beta chains and subsequently fine-tuned to process alpha chains. B: Architecture for integrating TCRs into ImmugenX predictions by concatenating module outputs into a single sentence to be processed by the prediction head. C and D: Performance of ImmugenX on the NetTCR 2.1 dataset task. Both models are trained on a dataset with 5-fold cross-validation using folds provided by the authors and tested on the 6th fold (results shown). E: Cross-validation performance of both STAPLER and ImmugenX on the training dataset provided by the authors. F and G: Holdout test performance of ImmugenX on the STAPLER holdout set after training using their dataset. An additional version of ImmugenX without HLA inputs is shown to demonstrate the negative impact of HLA inputs in this data split.
Fig 4
Fig 4. pMHC immunogenicity prediction with additional TCR information.
ImmugenX makes pMHC immunogenicity predictions in a given TCR context. A: Architecture for integrating CDR3 sequences into ImmugenX predictions, utilising a pretrained self-supervised CDR3 encoder. B: Training dataset sources for fine-tuning the immunogenicity prediction head of ImmugenX. C: Composition of the fine-tuning dataset, utilising 3 types of negative data: positive pMHCs paired with negative background TCRs; Correctly matched MHC-TCR pairs with negative background wild-type peptides from the human proteome; Non-immunogenic pMHCs from TCR assays paired with TCRs from the positive training set. D: Composition of the neoantigen holdout test set from CEDAR, restricted to positives with known TCR triplets. Negative data points were created using negative pMHCs paired with TCRs from the positive set. E and F: Performance curves for ImmugenX with and without the TCR information input, along with netMHCpan 4.1 and BigMHC.
Fig 5
Fig 5. pMHC encoding representations of physiochemical properties.
t-SNE visualisations of both the CLS token and all peptide tokens coloured for the physiochemical properties of the input peptide. R2 values are shown for predicting the true value of each metric with a support vector machine model using the encodings as feature inputs. The bottom plots show the respective t-SNE plots coloured for the 8 most frequent HLA alleles, demonstrating separation between encodings for some HLAs both in the CLS token and peptide token encodings.
Fig 6
Fig 6. Input residue usage by ImmugenX from estimated SHAP values.
A: SHAP values by masking the test inputs based on the background distribution of inputs calculated by the training set. Over permutations of masking input values the importance of each input residue to ImmugenX can be estimated. B: Peptide residue importance for the pMHC immunogenicity model across the immunogenicity test set. C: Peptide residue split by peptide length for the 4 most prevalent lengths. D/E: Peptide residue importance for the stability pMHC base module across all data and the most common peptide lengths. F: CDR3 residue usage for the TCR-Immunogenicity prediction model. G: Relative mean importance for all residues in the TCR-Immunogenicity model across peptide, HLA and TCR fractions. H: SHAP values for HLA-A*02:01 9mers for the stability pMHC module, demonstrating specific importance applied to the established anchor residues at positions 2 and 9.
Fig 7
Fig 7. Model Performance Analysis on pMHC Test Set.
A: Scatter plot of the ImmugenX stability pMHC sub-module and NetMHCstabpan demonstrating high agreement on the test set along with a large population of positive data points with low predicted stability by both models. B: ImmugenX scores plotted against its stability-trained pMHC sub-module. The improved performance was achieved by the rescue of low-stability pMHCs after training on the immunogenicity task. C: ImmugenX and BigMHC scores scatter plot. The Red dashed line shows the mean scores for both models on the whole dataset. Break out tables showing metrics for positive and negative samples separately in each quadrant indicating shifts in mean values for correctly and incorrectly predicted pMHCs, along with differences between the models.

References

    1. Rosenberg SA, Parkhurst MR, Robbins PF. Adoptive cell transfer immunotherapy for patients with solid epithelial cancers. Cancer Cell. 2023;41(4):646–648. doi: 10.1016/j.ccell.2023.03.003 - DOI - PMC - PubMed
    1. Wells DK, van Buuren MM, Dang KK, Hubbard-Lucey VM, Sheehan KCF, Campbell KM, et al.. Key Parameters of Tumor Epitope Immunogenicity Revealed Through a Consortium Approach Improve Neoantigen Prediction. Cell. 2020;183(3):818–834.e13. doi: 10.1016/j.cell.2020.09.015 - DOI - PMC - PubMed
    1. Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Medicine. 2016;8(1):33. doi: 10.1186/s13073-016-0288-x - DOI - PMC - PubMed
    1. O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Systems. 2018;7(1):129–132.e4. doi: 10.1016/j.cels.2018.05.014 - DOI - PubMed
    1. Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Research. 2020;48(W1):W449–W454. doi: 10.1093/nar/gkaa379 - DOI - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources