Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2018 Jul;284(1):167-179.
doi: 10.1111/imr.12665.

Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination

Affiliations
Review

Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination

Yuval Elhanati et al. Immunol Rev. 2018 Jul.

Abstract

Despite the extreme diversity of T-cell repertoires, many identical T-cell receptor (TCR) sequences are found in a large number of individual mice and humans. These widely shared sequences, often referred to as "public," have been suggested to be over-represented due to their potential immune functionality or their ease of generation by V(D)J recombination. Here, we show that even for large cohorts, the observed degree of sharing of TCR sequences between individuals is well predicted by a model accounting for the known quantitative statistical biases in the generation process, together with a simple model of thymic selection. Whether a sequence is shared by many individuals is predicted to depend on the number of queried individuals and the sampling depth, as well as on the sequence itself, in agreement with the data. We introduce the degree of publicness conditional on the queried cohort size and the size of the sampled repertoires. Based on these observations, we propose a public/private sequence classifier, "PUBLIC" (Public Universal Binary Likelihood Inference Classifier), based on the generation probability, which performs very well even for small cohort sizes.

Keywords: TCR repertoires; TCR sharing; inference; probability of generation; public sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Cartoon representation of the pipeline for computing the distribution of shared sequences between samples. (A) Sharing between samples is analyzed by marking repeated CDR3s between K samples. (B) The overlapping sequences are counted and binned, and the number of CDR3s that were shared m times is computed. (C) Distribution of the number of sequences that are shared m times between the sample of K individuals
Figure 2
Figure 2
Distribution of sharing numbers. (A) Distribution of the number of sequences that are shared between m individuals (m = sharing number) for 14 mice. Data points (blue crosses) are compared to analytical model predictions (see Section 7.3.1) with selection (red curves) and without selection (green curve), and with simulations (see Section 7.2) based on the generation model with selection (red crosses) and without selection (green crosses). While the model without selection underestimates sharing, the prediction is improved by adding selection. The model predictions derived from analytical calculations and stochastic simulations agree well. The selection factor q, defined as the probability of a CDR3 to pass thymic selection, is inferred by least‐square regression from the relation between the number of unique CDR3 amino acid sequences with the number of unique nucleotide sequence reads (inset, see Section 7). (B) Distribution of sharing numbers in a cohort of 658 humans. The model prediction with selection (simulation: black crosses, analytics: red line) agrees well with the data (blue crosses). The selection factor is obtained as for mice (inset)
Figure 3
Figure 3
The sharing number depends on the sampling depth and cohort size. Downsampling the number of sequences in all individuals affects sharing, and decreases the observed probability to be public. (A) The number of sequences for each sharing number decreases as the repertoires of all individual are downsampled by a factor 0.5 (blue points) compared to the original sample (red points), as predicted by the model (red and blue lines). The normalized distribution of sharing numbers (inset) shows that downsampling affects larger sharing numbers more. (B) Model prediction of the fraction of sequences that are entirely private (ie, appearing in just one individual), as a function of the downsampling fraction and cohort size. Larger samples and cohorts result in fewer private sequences
Figure 4
Figure 4
(A) Number of unique CDR3 amino acid sequences in the pooled repertoire of n individuals, as a function of n. This number does not depend strongly on the order in which individuals are added to the group (black error bars, obtained by measuring variations across 30 random orderings). The theoretical prediction (red line, see Section 7.3.4) agrees very well with the data. The model prediction was obtained using the mean sample size of the pooled repertoire across 30 random orderings. Each new individual adds ∼200 000 new CDR3 sequences. (B) Theoretical extrapolation to very large cohorts (red line). This model prediction is based on an average sample size. The same prediction can be done for the full repertoires contained in the human body (with 1011 unique recombination events), which yields much larger numbers of unique CDR3s (black line). (C) Model prediction for the fraction of sequences in each individual that are truly “public,” ie, have a generation probability larger than 1/N, where N is the number of unique TCRs in each individual (repertoire size). The red and blue stripes mark the possible range of repertoire sizes in mice and humans, according to current knowledge
Figure 5
Figure 5
Distributions of the logarithm of the generation probability for different minimal sharing numbers, for (A) mice and (B) humans. For larger sharing numbers, the distribution shifts toward higher probabilities and becomes narrower. This shift enables the characterization of the sharing number, or the degree of publicness, using the generation probability. The model captures the right trend of the sharing numbers, despite predicting much narrower distributions
Figure 6
Figure 6
Cartoon representation of the pipeline for the PUBLIC classifier. (A) To each CDR3 sequence in the dataset we associate its generation probability (p gen), which PUBLIC uses to predict the empirical sharing number. (B) The p gen distributions of shared sequences depend on the sharing number m. We pick a classifier threshold value of P gen, θ, that separates public from private sequences for this sharing number value of m. The areas of the histograms that fall on the wrong side of the threshold are defined as the false positive and false negative rates. (C) For a given choice of the minimal sharing number m, we plot the true and false positive rates as a function of the classifier threshold θ to obtain a receiver operating characteristic
Figure 7
Figure 7
Performance of the PUBLIC classifier. Receiver operating characteristic (ROC) curves for (A) mice and (B) humans for different minimal sharing numbers m. Inset: the area under the ROC curve (AUROC) describes the probability of classifying a given sequence as public or private. Higher AUROC values correspond to a better a classifier. The AUROC score increases with the minimal sharing number m (inset), meaning that a more restrictive definition of publicness gives better classifiers
Figure 8
Figure 8
Distribution of sharing numbers in a cohort of 30 bladder cancer patients. The distribution is compared to a sub‐cohort of 30 healthy individuals downsampled to have the same sample sizes as the cancer samples. The distribution are the same in healthy and bladder cancer patients, indicating that there are no common significantly over‐represented TCRs in the blood repertoire of cancer patients

Similar articles

Cited by

References

    1. Mora T, Walczak A. Quantifying lymphocyte receptor diversity. In: Das J, Jayaprakash C, eds. Systems Immunology: An Introduction to Modeling Methods for Scientists. Boca Raton, FL: CRC Press, Taylor and Francis; 2018;1–10.
    1. Moss PA, Moots RJ, Rosenberg WM, et al. Extensive conservation of alpha and beta chains of the human T‐cell antigen receptor recognizing HLA‐A2 and influenza A matrix peptide. Proc Natl Acad Sci USA. 1991;88:8987–8990. - PMC - PubMed
    1. Casanova JL, Cerottini JC, Matthes M, et al. H‐2‐restricted cytolytic T lymphocytes specific for HLA display T cell receptors of limited diversity. J Exp Med. 1992;176:439–447. - PMC - PubMed
    1. Argaet VP, Schmidt CW, Burrows SR, et al. Dominant selection of an invariant T cell antigen receptor in response to persistent infection by Epstein‐Barr virus. J Exp Med. 1994;180:2335–2340. - PMC - PubMed
    1. Cibotti R, Cabaniols JP, Pannetier C, et al. Public and private V beta T cell receptor repertoires against hen egg white lysozyme (HEL) in nontransgenic versus HEL transgenic mice. J Exp Med. 1994;180:861–872. - PMC - PubMed

Publication types

Substances

LinkOut - more resources