Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Feb 12;18(1):20.
doi: 10.1186/s13073-025-01589-4.

Inference of SARS-CoV-2 exposure biomarkers using large-scale T-cell repertoire profiling

Affiliations

Inference of SARS-CoV-2 exposure biomarkers using large-scale T-cell repertoire profiling

Elizaveta K Vlasova et al. Genome Med. .

Abstract

Background: The COVID-19 pandemic offers a powerful opportunity to develop methods for monitoring the spread of infectious diseases based on their signatures in population immunity. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become the method of choice for identifying T cell receptor (TCR) biomarkers encoding pathogen specificity and immunological memory. AIRR-seq can detect imprints of past and ongoing infections and facilitate the study of individual responses to SARS-CoV-2, as shown in many recent studies.

Methods: The new batch effect correction method allowed us to use data from different batches together, as well as combine the analysis for data obtained using different protocols. Proper standardization of AIRR-seq batches, access to human leukocyte antigen (HLA) typing, and the use of both α- and β-chain sequences of TCRs resulted in a high-quality biomarker database and a robust and highly accurate classifier for COVID-19 exposure.

Results: Here, we have applied a machine learning approach to two large AIRR-seq datasets with more than 1,200 high-quality repertoires from healthy and COVID-19-convalescent donors to infer TCR repertoire features that were induced by SARS-CoV-2 exposure.

Conclusions: This developed classifier is applicable to individual TCR repertoires obtained using various protocols, paving the way to AIRR-seq-based immune status assessment in large cohorts of donors.

Keywords: COVID-19; Immune biomarkers; Immune repertoires; Phenotype prediction; T cell receptor; TCR specificity.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: The study was conducted in accordance with the Declaration of Helsinki and approved by the Local Ethics Committee of the Federal State Budgetary Institution “Centre for Strategic Planning and Management of Biomedical Health Risks”, FMBA of Russia (Protocol No. 2, May 28, 2020). Written informed consent was obtained from all participants. Consent for publication: Not applicable. All the data used in the research and available online was depersonalized. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Experimental and analytical procedure. A Peripheral blood samples were obtained for donors from Cohort I with known COVID-19 status and HLA context. TCRα and β loci were sequenced using a conventional protocol involving multiplex PCR (see Methods). Samples were preprocessed and mapped to identify V/D/J alleles and extract CDR3 regions. A batch-effect correction procedure was applied to normalize V(D)J rearrangement frequencies. B TCRα and β biomarkers were selected for both T cell chains separately using the Fisher exact test. These biomarkers were aggregated by clustering TCR sequences into ‘metaclonotypes’. Annotation of these clusters was performed using a database of TCR sequences with known antigen specificity, association with HLA metadata, and the results of α-β chain pairing analysis. C Classifiers including various sets of features were constructed based on metaclonotype biomarkers described above. These classifiers were trained and evaluated using a leave-one-batch-out cross-validation technique. D We used another previously published cohort (Cohort II) to assess the robustness of our classifier. Batch-effect correction and sample preprocessing techniques were applied to both cohorts, and the classifier was trained entirely on one of the cohorts and validated on the other
Fig. 2
Fig. 2
Batch-effect and genotype-related differences in V gene usage. A A bar plot of total number of subjects and the number of COVID-19-convalescent and healthy cases for nine donor sample batches from Cohort I. Numbers reflect samples that passed the sequencing depth cutoff for both TCRα and β chains. B Heatmap of gene usage, showing the imprint of TRBV28/6–2/4–3 haplotypes in the dataset. Black lines show gene usage Z-scores for each of the three genes. A min–max normalization approach is used to plot the distribution of usages within each gene for all the samples, so the values for each gene lie in the [0, 1] interval. C t-SNE plots of TRBV usage according to the frequency of TRBV28, TRBV6-2 and TRBV4-3 gene usage. D, E Visualization of batch effects using t-SNE. Plots show similarity of TRBV (D) and TRAV (E) gene usage profiles between samples. Samples are colored by batch. F, G Same as D, E, but for TRBV (F) and TRAV (G) gene usage profiles after batch-effect correction. H t-SNE plots of TRBV usage for the batch effect corrected data according to the frequency of TRBV28, TRBV6-2 and TRBV4-3 gene usage
Fig. 3
Fig. 3
Assessment of COVID-19 TCR α and β biomarkers. A, B Scatter plot showing the number of COVID-19-associated TCRα (A) and TCRβ (B) clonotypes (Y axis) plotted against the total number of clonotypes in each sample (X axis) from the validation batch. All numbers are given in terms of unique rearrangements; matching to associated clonotypes allows up to one amino acid substitution in CDR3. C, D Distribution of COVID-19-associated TCRα (C) and TCRβ (D) clonotype fraction across healthy and convalescent samples. E, F CDR3 sequence similarity graph of COVID-19-associated TCRα (E) and TCRβ (F) clonotypes, where edges (not shown) connect sequences with up to one amino acid substitution. Each connected cluster is highlighted with its own color. Predicted antigen specificity according to VDJdb is shown with arrows and labels. Green labels correspond to SARS-CoV-2 epitopes, red labels to the other viruses. The detailed list of associations is available in Supplementary Data 2 (TCRα) and 3 (TCRβ). G Co-occurrence of CDR3 sequences in α and β chain clonotype clusters. The color corresponds to the correlation coefficient between the usage of α and β cluster clonotypes in the specified clusters. Bold squares show α-β pairings that demonstrate significant association with the same antigen according to VDJdb. H, I CDR3 sequence logos of top four largest clusters in E and F (TCR α and β respectively)
Fig. 4
Fig. 4
Building a classifier from TCRβ and α biomarkers. Classifiers were trained on all batches except #6, which was left out for validation purposes. A Comparison of ML model F1 scores produced for each feature set. B Scatter plot of the probabilities of labeling samples as COVID-19-positive for TCRα- and β-based classifiers. Histograms on the periphery show the distribution of probabilities for each variable. C Scatter plot of the probabilities of labeling samples COVID-19-positive for classifiers based on both TCRα + β biomarkers and TCRα + β metaclonotype cluster features. The periphery plots are similar to B. D Target metrics (F1-score, precision, recall) for all evaluated models. E Receiver operating curve (ROC) for SVM-based classifiers for all sets of features. F Precision-recall curve for SVM-based classifiers using different biomarker sets. G Waterfall plot of the probability of each sample being labeled as COVID-19 positive (> 0) or healthy (< 0). Samples from healthy donors are blue, COVID-19 samples are orange. H Feature importance plot for XGBoost or RandomForest classifier models based on TCRα- and β-based meta-biomarkers and HLA features
Fig. 5
Fig. 5
Comparative analysis of classifiers between cohorts. A, B Visualization of batch effects pre- (A) and post- (B) correction procedure for TRBV genes. Colors show sample batch. C Homology graph of CDR3 sequences of COVID-19-associated TCRβ clonotypes for Cohort II. The most probable epitope is marked for each cluster associated with SARS-CoV-2. The detailed list of associations is available in Supplementary Data 4. D Homology graph of CDR3 sequences of COVID-19-associated clonotypes for both Cohort I and Cohort II TCRβ biomarkers. Clusters containing clonotypes from both classifiers are colored orange. The detailed list of associations is available in Supplementary Data 5. E ROC-curves for Cohort I- or Cohort II-based classifiers. Lines marked with “meta” sign correspond to classifiers built based on metaclonotypes. AUC, area under curve. F Comparison of metrics for the models described in E
Fig. 6
Fig. 6
Analysis of DRB1*16, DQB1*05 and COVID-19-associated TCRβ biomarkers. A Consensus motif for the 13 TCRβ biomarker associated with the DRB1*16, DQB1*05 HLA alleles and COVID-19 history. B The number of TCRβ motif biomarker reads versus number of unique TCR sequences in healthy and COVID-19 cohorts. C Heatmap representing the linkage of HLAs of interest for all the COVID-19 patients with at least two motif reads present in a sample
Fig. 7
Fig. 7
Comparison of classifiers built using biomarkers derived from random samples and A*02+ samples. A Homology graph of CDR3 sequences of COVID-19-associated TCRβ clonotypes derived using only A*02+ samples. For each cluster associated with SARS-CoV-2, the most probable epitope is marked. Green labels show SARS-CoV-2 epitopes, red labels belong to other viruses. Detailed information on cluster associations is available in Supplementary Data 6. B Same as A, but for TCRα biomarkers. Only matches for clusters of size > 9 are shown. For more information see Supplementary Data 6. C Comparison of model performance for different datasets. Random alpha and beta datasets correspond to classifiers built using biomarkers derived from 545 and 521 random TCRα and β samples, respectively, in Cohort I. A02 datasets correspond to classifiers built for A*02+ biomarkers and samples. D Precision-recall curve for SVM classifiers built for the datasets in C Classifiers using single HLA allele samples and single HLA allele SARS-CoV-2-associated clonotypes provided better classification

References

    1. Janeway Jr CA, Travers P, Walport M, and Mark J Shlomchik. Immunobiology. Garland Science. 2001. ISBN-10:0–8153–3642-X. NCBI Bookshelf: https://www.ncbi.nlm.nih.gov/books/NBK10757/.
    1. Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. Rep-seq: uncovering the immunological repertoire through next-generation sequencing. Immunology. 2012;135:183–91. 10.1111/j.1365-2567.2011.03527.x. - DOI - PMC - PubMed
    1. Pogorelyy MV, Fedorova AD, McLaren JE, Ladell K, Bagaev DV, Eliseev AV, et al. Exploring the pre-immune landscape of antigen-specific T cells. Genome Med. 2018;10:68. 10.1186/s13073-018-0577-7. - DOI - PMC - PubMed
    1. DeWitt WS III, Smith A, Schoch G, Hansen JA, Matsen FA IV, Bradley P. Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity. Walczak AM, Chakraborty AK, Elhanati Y, Gerritsen B, editors. eLife. 2018;7:e38358. 10.7554/eLife.38358. - DOI - PMC - PubMed
    1. Rosati E, Pogorelyy MV, Dowds CM, Moller FT, Sorensen SB, Lebedev YB, et al. Identification of disease-associated traits and clonotypes in the T cell receptor repertoire of monozygotic twins affected by inflammatory bowel diseases. J Crohns Colitis. 2020;14:778–90. 10.1093/ecco-jcc/jjz179. - DOI - PMC - PubMed