Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 25;22(1):159.
doi: 10.1186/s12859-021-04087-7.

Clustering based approach for population level identification of condition-associated T-cell receptor β-chain CDR3 sequences

Affiliations

Clustering based approach for population level identification of condition-associated T-cell receptor β-chain CDR3 sequences

Dawit A Yohannes et al. BMC Bioinformatics. .

Abstract

Background: Deep immune receptor sequencing, RepSeq, provides unprecedented opportunities for identifying and studying condition-associated T-cell clonotypes, represented by T-cell receptor (TCR) CDR3 sequences. However, due to the immense diversity of the immune repertoire, identification of condition relevant TCR CDR3s from total repertoires has mostly been limited to either "public" CDR3 sequences or to comparisons of CDR3 frequencies observed in a single individual. A methodology for the identification of condition-associated TCR CDR3s by direct population level comparison of RepSeq samples is currently lacking.

Results: We present a method for direct population level comparison of RepSeq samples using immune repertoire sub-units (or sub-repertoires) that are shared across individuals. The method first performs unsupervised clustering of CDR3s within each sample. It then finds matching clusters across samples, called immune sub-repertoires, and performs statistical differential abundance testing at the level of the identified sub-repertoires. It finally ranks CDR3s in differentially abundant sub-repertoires for relevance to the condition. We applied the method on total TCR CDR3β RepSeq datasets of celiac disease patients, as well as on public datasets of yellow fever vaccination. The method successfully identified celiac disease associated CDR3β sequences, as evidenced by considerable agreement of TRBV-gene and positional amino acid usage patterns in the detected CDR3β sequences with previously known CDR3βs specific to gluten in celiac disease. It also successfully recovered significantly high numbers of previously known CDR3β sequences relevant to each condition than would be expected by chance.

Conclusion: We conclude that immune sub-repertoires of similar immuno-genomic features shared across unrelated individuals can serve as viable units of immune repertoire comparison, serving as proxy for identification of condition-associated CDR3s.

Keywords: Antigen-specific TCR identification; Celiac disease associated TCR clonotypes; Computational antigen-specificity identification; Immune repertoire analysis; Immuno-informatics; TCR clustering; TCR differential abudance analysis; TCR repertoire analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic of the clustering-based differential abundance detection methodology for CDR3 repertoire HTS datasets
Fig. 2
Fig. 2
CDR3 sub-repertoire matching in samples of two unrelated individuals. a hierarchical clustering of CDR3 cluster centroids from samples CD005 (black) and CD006 (green) from our CD PBMC dataset identified 32 sub-repertoires of which 30 (94%) had cluster representatives from both samples. Branch colors indicate sub-repertoires. Only 2 of the 32 (6%) sub-repertoires (shown in black dots) are homogenous, containing cluster centroids from only one sample. b V-, J-, VJ- and VDJ gene usage frequency was compared between clusters coming from the two samples, the percentage of sub-repertoires with significantly different gene usage with p value below 0.05 (using chi-square test of independence) is shown. c Number of different possible 4-mers that start at each position is estimated using Shannon’s entropy for 42nt long CDR3s, highest entropy is observed in positions in which CDR3s have the N1 and N2 region. Similar result was obtained in all samples. 4-mers that are not completely within the N1 or N2 region but either end or start in the regions are counted towards them. d Top 20 4-mers with the highest variance in frequency across the 5000 subsampled CDR3s within a single sample (CD005) is shown. e The frequency of where (in V, N1, D, N2, J) the top 20 most variable 4-mers are found in the CDR3s is shown. f The classification importance of k-mers and genes in distinguishing 4-mer based clusters within a single sample (CD005) is shown. g The frequency of where (in V, N1, D,N2, J) the top 20 most discriminative 4-mers (ordered left to right) are found in the CD005 repertoire is shown
Fig. 3
Fig. 3
Sub-repertoire detection across many samples. CDR3 clustering, and sub-repertoire detection using both hierarchical clustering (hc) and k-means (km) clustering of the CDR3 cluster centroids was performed for all 8 CD PBMC samples. a Shows proportions of sub-repertoires containing CDR3 cluster centroids from only n samples, dots are the estimate from each of 10 analyses from subsampled repertoires with sequencing depths of 1 to 10 thousand unique nucleotide CDR3s per sample. b The cumulative proportion of sub-repertoires containing representative clusters from n samples or more is shown; the cumulative at each n is computed as the mean proportion of n represented samples from the 10 resample analyses
Fig. 4
Fig. 4
Differentially abundant CDR3β sequences identified by the method. The top 20 significantly differentially enriched CDR3β sequences during gluten exposure are shown for a CD PBMC and b CD Gut datasets when using nt 4-mer feature vectors. The result obtained using aa 3-mer feature vectors is on c for CD PBMC and d for CD Gut datasets. Abundance is shown in log10 scale, from low abundance (white) to higher abundance (red). GFD treated samples are shown in light blue and gluten exposed samples are shown in orange bars at the top
Fig. 5
Fig. 5
Characteristics of the differentially abundant CDR3β sequences in CD PBMC and CD Gut. The differentially enriched CDR3β sequences had biased usage of TRBV genes that are known to be over-represented in gluten reactive CDR3β sequences in previous studies, such as TRBV07-02 and TRBV09-01 from CD PBMC (a), and TRBV06-01 from CD Gut (b) (observed frequencies are shown in red, mean frequency from randomly generated sets of CDR3s are shown in blue). Significantly over-used amino acids at each position are shown for the enriched CDR3β sequences that use TRBV genes detected to be over-used from CD PBMC (c) and CD Gut (d), amino acids are colored according to their properties. The information content of significantly overused amino acids at each position is shown in bits on the y-axis. TRBV and per-position amino acid over-usage is assessed by comparing the observed frequencies in the set of differentially enriched CDR3s to that obtained by chance in 100 randomly sampled CDR3s of same size, TRBV gene and CDR3 length, with p < 0.05 considered significant (gene names indicate TRBVgene::CDR3 length::number of CDR3s in the enriched list with the Vgene and CDR3 length). The results from using nt 4-mer feature vectors are shown

Similar articles

Cited by

References

    1. Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology. 2012;135(3):183–191. doi: 10.1111/j.1365-2567.2011.03527.x. - DOI - PMC - PubMed
    1. Vanhanen R, Heikkilä N, Aggarwal K, Hamm D, Tarkkila H, Pätilä T, et al. T cell receptor diversity in the human thymus. Mol Immunol. 2016;1(76):116–122. doi: 10.1016/j.molimm.2016.07.002. - DOI - PubMed
    1. Qi Q, Liu Y, Cheng Y, Glanville J, Zhang D, Lee J-Y, et al. Diversity and clonal selection in the human T-cell repertoire. Proc Natl Acad Sci. 2014;111(36):13139–13144. doi: 10.1073/pnas.1409155111. - DOI - PMC - PubMed
    1. Venturi V, Price DA, Douek DC, Davenport MP. The molecular basis for public T-cell responses? Nat Rev Immunol. 2008;8(3):231–238. doi: 10.1038/nri2260. - DOI - PubMed
    1. Li H, Ye C, Ji G, Han J. Determinants of public T cell responses. Cell Res. 2012;22(1):33–42. doi: 10.1038/cr.2012.1. - DOI - PMC - PubMed

Substances

LinkOut - more resources