Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar;8(3):396-408.
doi: 10.1158/2326-6066.CIR-19-0464. Epub 2019 Dec 23.

High-Throughput Prediction of MHC Class I and II Neoantigens with MHCnuggets

Affiliations

High-Throughput Prediction of MHC Class I and II Neoantigens with MHCnuggets

Xiaoshan M Shao et al. Cancer Immunol Res. 2020 Mar.

Abstract

Computational prediction of binding between neoantigen peptides and major histocompatibility complex (MHC) proteins can be used to predict patient response to cancer immunotherapy. Current neoantigen predictors focus on in silico estimation of MHC binding affinity and are limited by low predictive value for actual peptide presentation, inadequate support for rare MHC alleles, and poor scalability to high-throughput data sets. To address these limitations, we developed MHCnuggets, a deep neural network method that predicts peptide-MHC binding. MHCnuggets can predict binding for common or rare alleles of MHC class I or II with a single neural network architecture. Using a long short-term memory network (LSTM), MHCnuggets accepts peptides of variable length and is faster than other methods. When compared with methods that integrate binding affinity and MHC-bound peptide (HLAp) data from mass spectrometry, MHCnuggets yields a 4-fold increase in positive predictive value on independent HLAp data. We applied MHCnuggets to 26 cancer types in The Cancer Genome Atlas, processing 26.3 million allele-peptide comparisons in under 2.3 hours, yielding 101,326 unique predicted immunogenic missense mutations (IMM). Predicted IMM hotspots occurred in 38 genes, including 24 driver genes. Predicted IMM load was significantly associated with increased immune cell infiltration (P < 2 × 10-16), including CD8+ T cells. Only 0.16% of predicted IMMs were observed in more than 2 patients, with 61.7% of these derived from driver mutations. Thus, we describe a method for neoantigen prediction and its performance characteristics and demonstrate its utility in data sets representing multiple human cancers.

PubMed Disclaimer

Conflict of interest statement

Disclosure of potential conflicts of interest

The terms of these arrangements are managed by Johns Hopkins University in accordance with its conflict of interest policies.

Potential Conflicts of Interest: The terms of these arrangements are managed by Johns Hopkins University in accordance with its conflict of interest policies.

Figures

Figure 1.
Figure 1.. A) MHCnuggets’ architecture.
A network is trained for each MHC allele. Each network has a LSTM layer with 64 hidden units, a Fully Connected (FC) layer with 64 hidden units and a final output layer of a single sigmoid unit. B) Input scheme for peptides with variable lengths. MHCnuggets architecture is capable of handling peptides of any length, but in practice a maximum length should be selected. Peptides are extended with padding until they reach the maximum length, prior to input into the neural network. The example shows padding for class II peptides with maximum length set to 30 amino acids. C) Transfer learning protocol for parameter sharing among alleles. A base allele-specific network is trained for each MHC class, with an allele selected by largest number of training examples. Transfer learning is applied to train networks for the remaining alleles with initial network weights set to final base network weights. A fine-tuning step identifies alleles that can be leveraged for a second round of transfer learning to produce a final network (Methods).
Figure 2.
Figure 2.. MHCnuggets’ features.
A) Venn diagram representation of the MHC-peptide binding prediction functions of MHCnuggets and similar tools. B) Training and MHC allele model selection scheme for MHCnuggets.
Figure 3.
Figure 3.. MHC class I benchmark comparisons.
A) PPVn for MHC class I allele-specific prediction on binding affinity test sets from Bonsack et al. (7 alleles) and Kim et al. (53 alleles) (5,8) B) PPVn for MHC class I allele-specific prediction on HLAp BST data set (Bassani-Sternberg et al. and Trolle et al. (7,22)), stratified by allele (6 alleles). C) PPVn for MHC class I allele-specific prediction on HLAp BST data set (from B) stratified by peptide sequence length. D) True and false positives for each method on the top 50 ranked peptides from the HLAp BST data set. PPVn = positive predictive value on the top n ranked peptides, where n is the number of true binders. TP=true positives. FP=false positives.
Figure 4.
Figure 4.. MHC class II benchmark comparisons.
A) PPVn for MHC class II allele-specific prediction on binding affinity test set from Jensen et al. (27 alleles, stratified by allele). B) auROC, K-Tau, Pearson r scores for MHC class II alleles from five-fold cross-validation. NetMHCII2.3 performance is from their self-reported auROC. auROC= area under the receiving operator characteristic curve. K-Tau = Kendall’s tau correlation. PPVn = positive predictive value on the top n ranked peptides, where n is the number of true binders.
Figure 5.
Figure 5.. MHC class I and II benchmark comparisons to estimate rare allele performance.
A) Schematic representation of leave one molecule out (LOMO) testing. B) PPVn for MHC class I rare allele prediction on IEDB pseudo-rare alleles binding affinity test set (20 alleles, stratified by allele). C) PPVn for MHC class II rare allele prediction on binding affinity test set from Jensen et al. (27 alleles, stratified by allele) (39). D) auROC for MHC class II rare allele prediction on LOMO binding affinity test set from Jensen et al. (27 alleles, stratified by allele) (39). NetMHCIIpan3.2 results are from their self-reported auROC. auROC = area under the receiving operator characteristic curve. PPVn = positive predictive value on the top n ranked peptides, where n is the number of true binders.
Figure 6.
Figure 6.. Timing and scalability.
Runtime benchmark of tested methods using versions available on October 1, 2019 over a range of inputs (up to 1 million peptides). A) MHC class I prediction. B) MHC class II prediction
Figure 7.
Figure 7.. MHC class I IMMs in TCGA patients.
A). Number of predicted immunogenic missense mutations (IMMs) identified in 6,613 TCGA patients. Dotted line = mean IMMs per patient (15.6). Note, 123 patients had >100 predicted IMMs but are not included for visual clarity. B) Number of predicted IMMs by cancer type. C) IMMs shared by three or more patients and the cancer types in which they occurred. Each row represents a cancer type and each column illustrates the overlap of IMMs seen in a single cancer type or multiple cancer types. For example, the first column shows the number of IMMs shared among patients with colorectal adenocarcinoma (COAD) and uterine corpus endometrial carcinoma (UCEC). Bars to the left show the total number of unique IMMs in each cancer type. *Bar heights reflect count of unique shared IMMs, not total number of patients in which the IMM was observed. Cancer type abbreviations are in Methods. Image generated with UpSetR. D) Fibroblast growth factor receptor (FGFR3) IMM hot region identified by HotMAPs in bladder cancer (BLCA). IMMs shown and number of BLCA patients with the IMM: p.E216K (1), p.D222N (1), p.G235D (1) p.R248C (3) and p.S249C (24). Except for p.G235D, these IMMs are proximal to the interface of FGFR3 protein and the light and heavy chains of an antibody fragment designed for therapeutic application in bladder cancer (PDB ID: 3GRW) (61). ACC, adrenocortical carcinoma; BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; KICH, kidney chromophobe; KIRC; kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LGG, brain lower grade glioma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; PAAD, pancreatic adenocarcinoma; PCPG, pheochromocytoma and paraganglioma; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; SARC, sarcoma.

References

    1. Anagnostou V, Smith KN, Forde PM, Niknafs N, Bhattacharya R, White J, et al. Evolution of Neoantigen Landscape during Immune Checkpoint Blockade in Non–Small Cell Lung Cancer. Cancer Discovery 2017 - PMC - PubMed
    1. Yarchoan M, Johnson BA, Lutz ER, Laheru DA, Jaffee EM. Targeting neoantigens to augment antitumour immunity. Nature reviews Cancer 2017;17:209–22 - PMC - PubMed
    1. Lundegaard C, Lund O, Buus S, Nielsen M. Major histocompatibility complex class I binding predictions as a tool in epitope discovery. Immunology 2010;130:309–18 - PMC - PubMed
    1. Andreatta M, Nielsen M. Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 2016;32:511–7 - PMC - PubMed
    1. Kim Y, Sidney J, Buus S, Sette A, Nielsen M, Peters B. Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions. BMC Bioinformatics 2014;15:241- - PMC - PubMed

Publication types

MeSH terms

Substances