Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 29;4(1):vbae190.
doi: 10.1093/bioadv/vbae190. eCollection 2024.

epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction

Affiliations

epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction

My-Diem Nguyen Pham et al. Bioinform Adv. .

Abstract

Motivation: The prediction of the T-cell receptor (TCR) and antigen bindings is crucial for advancements in immunotherapy. However, most current TCR-peptide interaction predictors struggle to perform well on unseen data. This limitation may stem from the conventional use of TCR and/or peptide sequences as input, which may not adequately capture their structural characteristics. Therefore, incorporating the structural information of TCRs and peptides into the prediction model is necessary to improve its generalizability.

Results: We developed epiTCR-KDA (KDA stands for Knowledge Distillation model on Dihedral Angles), a new predictor of TCR-peptide binding that utilizes the dihedral angles between the residues of the peptide and the TCR as a structural descriptor. This structural information was integrated into a knowledge distillation model to enhance its generalizability. epiTCR-KDA demonstrated competitive prediction performance, with an area under the curve (AUC) of 1.00 for seen data and AUC of 0.91 for unseen data. On public datasets, epiTCR-KDA consistently outperformed other predictors, maintaining a median AUC of 0.93. Further analysis of epiTCR-KDA revealed that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model represents a significant step forward in developing a highly effective pipeline for antigen-based immunotherapy.

Availability and implementation: epiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA).

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare, except that M.-D.P. holds shares in NexCalibur Therapeutics, the company that has provided funding for the research presented in this publication.

Figures

Figure 1.
Figure 1.
Overview of epiTCR-KDA. (A) Diagram illustrating data collection for training and evaluation of epiTCR-KDA. Five public databases [IEDB (Vita et al. 2019), VDJdb (Shugay et al. 2018), TBAdb (Zhang et al. 2020), McPAS-TCR (Tickotsky et al. 2017), and 10X (A New Way of Exploring Immunity—Linking Highly Multiplexed Antigen Recognition to Immune Repertoire and Phenotype | Technology Networks, 2020)] were collected for TCR-peptide pairs, with publicly collected TCR labeled as (1) and publicly collected peptides labeled as (2) (Supplementary Table S1). Three databases [TSNAdb ( Wu et al. 2023a), Neodb (Wu et al. 2023b), and NEPdb (Xia et al. 2021)] were gathered for self-peptides (wildtype peptides), labeled as (3). These peptides were randomly combined with TIL TCR, labeled as (4), from public TCR-peptide pairs to form non-binding pairs [i.e. (3) combined with (4)]. Additionally, non-binding pairs were also generated from TIL CDR3β sequences with public wildtype peptides [i.e. (1) combined with (3)]. The data were divided into training data (Supplementary Fig. S2), and testing data covering various data sources, seen and unseen peptides (Supplementary Table S2). (B) Data preprocessing steps starting from the conversion of CDR3β/peptide amino acid sequences to 3D structures using OmegaFold, followed by the calculation of the phi and psi angles, and processing this information as input for the model (Supplementary Fig. S1). (C) Structure of the KD model. The CDR3β and peptide representation (phi and psi angles) were concatenated, padded, and served as input for the KD model. The KD model involved a student model learning from the information provided by the teacher model (soft loss) and ground-truth labels (hard loss). The model was trained to predict the binding or non-binding of CDR3β-peptide pairs.
Figure 2.
Figure 2.
The performance of epiTCR-KDA, epiTCR, NetTCR, BERTrand, TEIM-Seq, TEINet, and ImRex across different benchmark settings: (A) original models tested on 10 datasets containing peptides unseen from training of all those models, (B) retrained models on 10 overall testing sets including both seen and unseen data, (C) retrained models on data derived from seen peptides, (D) retrained models on data derived from seen peptides, and (E) retrained models on data derived from 7 dominant unseen peptides (Supplementary Table S7). The performance was measured by AUC. Each bar indicates the mean performance from ten testing sets and the error bar indicates the standard deviation. The original models of epiTCR and NetTCR were also benchmarked on interactions of unseen peptides; however, epiTCR produced only positive predictions, while NetTCR gave only negative predictions for all interactions. Consequently, AUC was not calculated for epiTCR and NetTCR in this testing scenario (Supplementary Table S5).
Figure 3.
Figure 3.
The influence of CDR3β and peptide structural information in training data on predictions by epiTCR-KDA. (A) Nine CDR3β and (B) nine peptides were chosen to represent nine clusters within the testing sets, and the predicted labels of their represented clusters were compared with the labels in training bins at different levels of dihedral angle-based cosine similarity using RMSE. The lower the RMSE, the more similar between prediction labels and training labels.
Figure 4.
Figure 4.
The performance different models on diverse testing scenarios. (A) epiTCR-KDA and original models on nine combined datasets, (B) epiTCR-KDA and original models on the COVID-19 dataset, and (C) epiTCR-KDA and original models on four datasets with an increasing number of non-binding pairs, (D) epiTCR-KDA and retrained models on nine combined datasets, (E) epiTCR-KDA and retrained models on the COVID-19 dataset, and (F) epiTCR-KDA and retrained models on four datasets with an increasing number of non-binding pairs.

References

    1. A New Way of Exploring Immunity—Linking Highly Multiplexed Antigen Recognition to Immune Repertoire and Phenotype | Technology Networks. (2020). Retrieved March 22, 2024. https://www.technologynetworks.com/immunology/application-notes/a-new-wa...
    1. Baek M, DiMaio F, Anishchenko I. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021;373:871–6. 10.1126/science.abj8754 - DOI - PMC - PubMed
    1. Bio.PDB.internal_coords module–Biopython 1.84.dev0 documentation. (n.d.). Retrieved March 22, 2024. https://biopython.org/docs/dev/api/Bio.PDB.internal_coords.html
    1. Chawla A, Yin H, Molchanov P. et al. Data-free knowledge distillation for object detection. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, p.3288–97, 2021. 10.1109/WACV48630.2021.00333 - DOI
    1. Croce G, Lani R, Tardivon D. et al. Phage display profiling of CDR3β loops enables machine learning predictions of NY-ESO-1 specific TCRs. BioRxiv, 2024, 10.1101/2024.06.27.600973, preprint: not peer reviewed. - DOI

LinkOut - more resources