Review

. 2021 Jul 22:12:680687.

doi: 10.3389/fimmu.2021.680687. eCollection 2021.

Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ ^N Using Natural Language Processing

Miri Ostrovsky-Berman^{1

2}, Boaz Frankel^{1

2}, Pazit Polak^{1

2}, Gur Yaari^{1

2}

Affiliations

¹ Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.
² Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel.

PMID: 34367141
PMCID: PMC8340020
DOI: 10.3389/fimmu.2021.680687

Review

Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ ^N Using Natural Language Processing

Miri Ostrovsky-Berman et al. Front Immunol. 2021.

. 2021 Jul 22:12:680687.

doi: 10.3389/fimmu.2021.680687. eCollection 2021.

Authors

Miri Ostrovsky-Berman^{1

2}, Boaz Frankel^{1

2}, Pazit Polak^{1

2}, Gur Yaari^{1

2}

Affiliations

¹ Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.
² Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel.

PMID: 34367141
PMCID: PMC8340020
DOI: 10.3389/fimmu.2021.680687

Abstract

The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.

Keywords: BCR repertoire; NLP; biological sequence embedding; computational immunology; word2vec.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
The research structure and workflow. **(A)** The analogy between natural language and the immunological language, on which we base our research. **(B)** The steps of Immune2vec model generation, described in details in the *Methods* section. **(C)** Word-level implementation of Immune2vec on amino acid 3-grams **(D)** Sequence-level classification on CDR3 embedded vectors, classifying them according to the IGHV family of the adjacent IGHV sequence. **(E)** Repertoire-level classification approach based on a nearest neighbors approach presented here. Created using the Weblogo tool (15).

**Figure 2**
Work flows applied to the different levels. **(A)** Training Immune2Vec. **(B)** Applying Immune2Vec for sequence level classification. **(C)** Applying Immune2Vec to repertoire level representation. **(D)** CDR3 sequence logo for 17 amino acids. Created using the Weblogo tool (15).

**Figure 3**
3-gram embedding analysis using several tools **(A)** 3-grams embeddings divided to clusters using k-means clustering **(B)** The same embedding whereeach point is colored according to its basic property value. **(C)** A box plot describing the distribution of basic property distances among all the points, vs. its distancedistribution in each cluster. Comparing distances between all data to the distances within clusters using the Mann Whitney test yielded a p value <10-20. **(D)** Moran’s index spatial auto-correlation analysis of properties in the embedding space.

**Figure 4**
**(A)** A description of the trimmed CDR3 sequences from the Ig heavy chain germline locus, used for the research. **(B)** F1-score of IGHV family classification based on CDR3 sequences using decision tree and kNN methods.

**Figure 5**
Accuracy of the SC-CI BCR and TCR repertoires classification. For validation purposes, the model was trained and applied on randomly labeled data.

**Figure 6**
**(A)** Model prediction total accuracy using different data sets as corpora for creating the embedding model. **(B)** Number of sequences in each corpus. DS6 was generated by randomly sampling sequences from DS5.

See this image and copyright information in PMC

References

1. Murphy K. Janeway’s Immunobiology. 9. New York, NY: Garland Science; (2016).
1. Briney B, Inderbitzin A, Joyce C, Burton DR. Commonality Despite Exceptional Diversity in the Baseline Human Antibody Repertoire. Nature (2019) 566:393–7. 10.1038/s41586-019-0879-y - DOI - PMC - PubMed
1. Yaari G, Kleinstein SH. Practical Guidelines for B-Cell Receptor Repertoire Sequencing Analysis. Genome Med (2015) 7:1–14. 10.1186/s13073-015-0243-2 - DOI - PMC - PubMed
1. Fu L, Niu B, Zhu Z, Wu S, Li W. Cd-Hit: Accelerated for Clustering the Next-Generation Sequencing Data. Bioinformatics (2012) 28:3150–2. 10.1093/bioinformatics/bts565 - DOI - PMC - PubMed
1. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, et al. The Properties of High-Dimensional Data Spaces: Implications for Exploring Gene and Protein Expression Data. Nat Rev Cancer (2008) 8:37. 10.1038/nrc2294 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ ^N Using Natural Language Processing

Affiliations

Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ ^N Using Natural Language Processing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources