Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 7;18(3):e0265313.
doi: 10.1371/journal.pone.0265313. eCollection 2023.

Dynamic kernel matching for non-conforming data: A case study of T cell receptor datasets

Affiliations

Dynamic kernel matching for non-conforming data: A case study of T cell receptor datasets

Jared Ostmeyer et al. PLoS One. .

Abstract

Most statistical classifiers are designed to find patterns in data where numbers fit into rows and columns, like in a spreadsheet, but many kinds of data do not conform to this structure. To uncover patterns in non-conforming data, we describe an approach for modifying established statistical classifiers to handle non-conforming data, which we call dynamic kernel matching (DKM). As examples of non-conforming data, we consider (i) a dataset of T-cell receptor (TCR) sequences labelled by disease antigen and (ii) a dataset of sequenced TCR repertoires labelled by patient cytomegalovirus (CMV) serostatus, anticipating that both datasets contain signatures for diagnosing disease. We successfully fit statistical classifiers augmented with DKM to both datasets and report the performance on holdout data using standard metrics and metrics allowing for indeterminant diagnoses. Finally, we identify the patterns used by our statistical classifiers to generate predictions and show that these patterns agree with observations from experimental studies.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. TCRs are examples of non-conforming data.
(a) Most statistical classifiers assume a fixed number of features (e.g. five) and that each feature represents the same kind of information across samples (e.g. shape). (b) These assumptions do not hold for non-conforming data. (c) A dataset of TCRs labelled by interaction with disease antigens. The dataset contains the amino acid symbols from regions of the TCR (CDR3) represented as sequences, which are examples of non-conforming data. (d) A dataset of TCR repertoires labelled by CMV serostatus. The dataset contains sequenced TCR repertoires represented as sets made of sequences, which is a different kind of non-conforming data than the previous dataset. (e) Samples are split into a training, validation, and test cohort (for panel c, identical sequences are first collapsed to ensure the same TCR does not appear in multiple cohorts). The training cohort is used to select the weights and biases of each model, the validation cohort is used for model selection, and the test cohort is used for reporting results.
Fig 2
Fig 2. Results on the antigen classification problem.
(a) The fit plotted for each weight update across the training (solid blue) and validation cohorts (solid red) steadily improves with each gradient optimization step. The fit to the test cohort after unblinding the samples (triangle) is significantly better than the baseline performance achievable by random chance (dashed black). (b) A confusion matrix of samples from the test cohort reveals the fraction of predictions that agree with the labels for each category (c) A 3D X-ray crystallographic structure of a TCR bound to GILGFVFTL:A0201. An alanine scan of the TCR CDR3 sequences reveals the largest |Δlogit| always corresponds to a contact position (asterisks).
Fig 3
Fig 3. Results on the repertoire Classification problem.
(a) The fit plotted for each weight update across the training (solid blue) and validation cohorts (solid red) steadily improves with each gradient optimization step. The fit to the test cohort after unblinding the samples (triangle) is better than the baseline performance achievable by random chance (dashed black). (b) A ROC curve reveals the sensitivity and specificity for various diagnostic thresholds of the model. (c) CDR3 sequences from receptors specific to various pMHCs are scored using the kernel of the model fitted in panel a. The receptors specific to a CMV peptide have the highest score, suggesting that the model has learned to identify receptors specific to this CMV peptide.
Fig 4
Fig 4. Results with confidence cutoffs.
The bar charts show the fraction of samples that are correctly classified (blue), incorrectly classified (red), and indeterminate (grey) (a) Using a 95% confidence cutoff determined on the validation cohort, the classification accuracy on the test cohort of the antigen classification problem is 97.06% capturing 44.5% of samples. The samples are not captured evenly across the six categories. (b) Using a 95% confidence cutoff determined using the validation cohort, the classification accuracy on the test cohort of the repertoire classification problem is 96.0% capturing 18.0% of samples. The samples are captured almost evenly across the two categories.

References

    1. Shimodaira H., Noma K.-i, Nakai M. and Sagayama S., "Dynamic Time-Alignment Kernel in Support Vector Machine," in Advances in Neural Information Processing Systems 14, 2001.
    1. B. K. Iwana, V. Frinken and S. Uchida, "A Robust Dissimilarity-Based Neural Network for Temporal Pattern Recognition," in 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016.
    1. Iwana B. K., Frinken V. and Uchida S., "DTW-NN: A novel neural network for time series recognition using dynamic alignment between inputs and weights," Knowledge Based Systems, vol. 188, p. 104971, 2020.
    1. 10xGenomics, A New Way of Exploring Immunity—Linking Highly Multiplexed Antigen Recognition to Immune Repertoire and Phenotype.
    1. Emerson R. O., DeWitt W. S., Vignali M., Gravley J., Hu J. K., Osborne E. J., Desmarais C., Klinger M., Carlson C. S., Hansen J. A., Rieder M. and Robins H. S., "Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire," Nature Genetics, vol. 49, no. 5, pp. 659–665, 2017. doi: 10.1038/ng.3822 - DOI - PubMed

Publication types

Substances