. 2024 May 28:15:1407470.

doi: 10.3389/fimmu.2024.1407470. eCollection 2024.

Interpretable deep learning reveals the role of an E-box motif in suppressing somatic hypermutation of AGCT motifs within human immunoglobulin variable regions

Abhik Tambe¹, Thomas MacCarthy², Rushad Pavri^{3

4}

Affiliations

¹ Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, United States.
² Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States.
³ Research Institute of Molecular Pathology (IMP), Vienna, Austria.
⁴ Peter Gorer Department of Immunobiology, School of Immunology & Microbial Sciences, King's College London, London, United Kingdom.

PMID: 38863710
PMCID: PMC11165027
DOI: 10.3389/fimmu.2024.1407470

Interpretable deep learning reveals the role of an E-box motif in suppressing somatic hypermutation of AGCT motifs within human immunoglobulin variable regions

Abhik Tambe et al. Front Immunol. 2024.

. 2024 May 28:15:1407470.

doi: 10.3389/fimmu.2024.1407470. eCollection 2024.

Authors

Abhik Tambe¹, Thomas MacCarthy², Rushad Pavri^{3

4}

Affiliations

¹ Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, United States.
² Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States.
³ Research Institute of Molecular Pathology (IMP), Vienna, Austria.
⁴ Peter Gorer Department of Immunobiology, School of Immunology & Microbial Sciences, King's College London, London, United Kingdom.

PMID: 38863710
PMCID: PMC11165027
DOI: 10.3389/fimmu.2024.1407470

Abstract

Introduction: Somatic hypermutation (SHM) of immunoglobulin variable (V) regions by activation induced deaminase (AID) is essential for robust, long-term humoral immunity against pathogen and vaccine antigens. AID mutates cytosines preferentially within WRCH motifs (where W=A or T, R=A or G and H=A, C or T). However, it has been consistently observed that the mutability of WRCH motifs varies substantially, with large variations in mutation frequency even between multiple occurrences of the same motif within a single V region. This has led to the notion that the immediate sequence context of WRCH motifs contributes to mutability. Recent studies have highlighted the potential role of local DNA sequence features in promoting mutagenesis of AGCT, a commonly mutated WRCH motif. Intriguingly, AGCT motifs closer to 5' ends of V regions, within the framework 1 (FW1) sub-region1, mutate less frequently, suggesting an SHM-suppressing sequence context.

Methods: Here, we systematically examined the basis of AGCT positional biases in human SHM datasets with DeepSHM, a machine-learning model designed to predict SHM patterns. This was combined with integrated gradients, an interpretability method, to interrogate the basis of DeepSHM predictions.

Results: DeepSHM predicted the observed positional differences in mutation frequencies at AGCT motifs with high accuracy. For the conserved, lowly mutating AGCT motifs in FW1, integrated gradients predicted a large negative contribution of 5'C and 3'G flanking residues, suggesting that a CAGCTG context in this location was suppressive for SHM. CAGCTG is the recognition motif for E-box transcription factors, including E2A, which has been implicated in SHM. Indeed, we found a strong, inverse relationship between E-box motif fidelity and mutation frequency. Moreover, E2A was found to associate with the V region locale in two human B cell lines. Finally, analysis of human SHM datasets revealed that naturally occurring mutations in the 3'G flanking residues, which effectively ablate the E-box motif, were associated with a significantly increased rate of AGCT mutation.

Discussion: Our results suggest an antagonistic relationship between mutation frequency and the binding of E-box factors like E2A at specific AGCT motif contexts and, therefore, highlight a new, suppressive mechanism regulating local SHM patterns in human V regions.

Keywords: E-box transcription factors; E2A; activation induced deaminase (AID); deep learning; immunoglobulin heavy chain; integrated gradients; somatic hypermutation (SHM).

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
DeepSHM model performance on AGCT 15-mers. **(A, B)** Correlation scatter plots between observed and DeepSHM-predicted mutation frequencies. 15-mers centered on G (AGCT) **(A)** or C (AGCT) **(B)**. Each dot represents a 15-mer, the black line is the x=y diagonal and the red line indicates the best fit with intercept and coefficient computed using a linear regression. The r value is the Pearson correlation coefficient, and the P value is computed using a Wald test. **(C, D)** Violin plots showing the distributions of observed (white) and DeepSHM-predicted (blue) mutation frequencies for AGCT 15-mers within CDR and FW regions centered on G (AGCT) **(C)** and C (AGCT) **(D)**. The white dots represent the median, the black boxes show the interquartile range, and the whiskers encapsulate points that fall between 1.5 times the inter-quartile range.

**Figure 2**
Integrated gradients scores for each nucleotide in AGCT 15-mers across V subregions shown as boxplots. The left column consists of sequences with a central G **(A–E)** and the right column consists of sequences with a central C **(F–J)**. Rows correspond to the indicated V subregion and the sequence logo below each boxplot corresponds to the nucleotide frequency at each position. The boxes represent the inter-quartile region of the distribution of integrated gradient scores for each nucleotide, with the black line through the box showing the median score and the whiskers representing 1.5 times the inter-quartile range. Outlier points are shows as dots. Note that nucleotides in the central AGCT hotspot (boxed) tend to have the largest scores in the 15-mer and that the 5’ and 3’ flanking nucleotides for the FW1 AGCT **(F)** have the lowest scores in the 15-mer.

**Figure 3**
Integrated gradients scores for the 5’ and 3’ flanking nucleotides of AGCT motifs across all human V regions shown as a scatter plot. Each dot/cross corresponds to a 15-mer. 15-mers in which the central AGCTs are flanked by 5’-C and 3’-G (CAGCTG motifs) are indicated with a cross (x).

**Figure 4**
Swarm plot depicting mutation frequencies for the central G and C residues within AGCT 15-mers categorized based on the identity of the 5’ and 3’ nucleotides flanking AGCT. The color coding highlights the location of the 15-mer in CDRs or FWs. Each AGCT is represented by two dots - one for the central C and one for the central G. AGCT motifs flanked by 5’-C and 3’-G, corresponding to the CAGCTG motif (first category on the left), has a significantly lower mutation frequency (P<10^–30) than any other pair as computed by a Mann-Whitney U Test.

**Figure 5**
Scatter plots depicting the correlation between observed mutation frequencies and E2A MOODS scores for AGCT 15-mers. **(A, B)** analysis of 15mers centered at the central G **(A)** or central C **(B)**. Each point represents a 15-mer and is colored by IMGT subregion, with CAGCTG 15-mers indicated with a cross (x). The red lines indicate the best fit with intercept and coefficient computed using a linear regression. The r value is the Pearson correlation coefficient, and the P value is computed using a Wald test. The three tiers (Tier 1–3) that the MOODs scores fall into are labeled.

**Figure 6**
Heatmap depicting the number of **(A)** CAGCTG and **(B)** CANNTG E-box motifs in human germline IGHV genes (y axis) classified into subregions (x axis) based on the IMGT nomenclature. Each cell corresponds to a distinct IGHV sub-region and is colored by the number of E-box motifs (between 0 and 3) in that sub-region as shown in the key on the right. Each row corresponds to a unique IGHV allele. The dashed horizontal lines represent boundaries between the seven IGHV families (IGHV1–7).

**Figure 7**
**(A, B)** Scatter plots depicting correlations between IgG control and E2A ChIP-seq shown as reads per kilobase million (RPKM) values in 500 bp genomic bins for Ramos **(A)** and GM12878 **(B)** cells. Bins containing the rearranged IGHV, Eμ enhancer and TRBV20–1 are highlighted in blue, orange and green, respectively. The black line represents the y=x diagonal.

**Figure 8**
Flowchart depicting synonymous mutations of the Gs at site 3 (G₃) of the CAGCTG E-box motif in FW1. Calculated mutation frequencies of G₃ sites before and after mutation of G₆ are indicated beside the arrows. Calculations were conducted using productive, clonally independent sequences across clones.

See this image and copyright information in PMC

References

1. Rajewsky K. Clonal selection and learning in the antibody system. Nature. (1996) 381:751–8. doi: 10.1038/381751a0 - DOI - PubMed
1. Victora GD, Nussenzweig MC. Germinal centers. Annu Rev Immunol. (2022) 40:413–42. doi: 10.1146/annurev-immunol-120419-022408 - DOI - PubMed
1. Muramatsu M, Kinoshita K, Fagarasan S, Yamada S, Shinkai Y, Honjo T. Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme. Cell. (2000) 102:553–63. doi: 10.1016/S0092-8674(00)00078-7 - DOI - PubMed
1. Revy P, Muto T, Levy Y, Geissmann F, Plebani A, Sanal O, et al. . Activation-induced cytidine deaminase (AID) deficiency causes the autosomal recessive form of the hyper-igM syndrome (HIGM2). Cell. (2000) 102:565–75. doi: 10.1016/S0092-8674(00)00079-9 - DOI - PubMed
1. Petersen-Mahrt SK, Harris RS, Neuberger MS. AID mutates E. coli suggesting a DNA deamination mechanism for antibody diversification. Nature. (2002) 418:99–103. doi: 10.1038/nature00862 - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Interpretable deep learning reveals the role of an E-box motif in suppressing somatic hypermutation of AGCT motifs within human immunoglobulin variable regions

Affiliations

Interpretable deep learning reveals the role of an E-box motif in suppressing somatic hypermutation of AGCT motifs within human immunoglobulin variable regions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources