. 2025 Jun 20;53(12):gkaf551.

doi: 10.1093/nar/gkaf551.

Building a neural network model to define DNA sequence specificity in V(D)J recombination

Justin C Harris¹, Jennifer N Byrum², Cooper B McKinney², Victoria Fairchild², Dee H Wu³, Andrew H Fagg⁴, Karla K Rodgers^{1

2}

Affiliations

¹ Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences, Oklahoma City, OK 73104, United States.
² Department of Microbiology and Immunology, University of Oklahoma Health Sciences, Oklahoma City, OK 73104, United States.
³ Department of Radiological Sciences, University of Oklahoma Health Sciences, Oklahoma City, OK, United States 73104.
⁴ Department of Computer Science, University of Oklahoma, Norman, OK 73019, United States.

PMID: 40548941
PMCID: PMC12205992
DOI: 10.1093/nar/gkaf551

Building a neural network model to define DNA sequence specificity in V(D)J recombination

Justin C Harris et al. Nucleic Acids Res. 2025.

. 2025 Jun 20;53(12):gkaf551.

doi: 10.1093/nar/gkaf551.

Authors

Justin C Harris¹, Jennifer N Byrum², Cooper B McKinney², Victoria Fairchild², Dee H Wu³, Andrew H Fagg⁴, Karla K Rodgers^{1

2}

Affiliations

¹ Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences, Oklahoma City, OK 73104, United States.
² Department of Microbiology and Immunology, University of Oklahoma Health Sciences, Oklahoma City, OK 73104, United States.
³ Department of Radiological Sciences, University of Oklahoma Health Sciences, Oklahoma City, OK, United States 73104.
⁴ Department of Computer Science, University of Oklahoma, Norman, OK 73019, United States.

PMID: 40548941
PMCID: PMC12205992
DOI: 10.1093/nar/gkaf551

Abstract

In developing lymphocytes, V(D)J recombination assembles functional antigen receptor (AgR) genes through rearrangement of the AgR loci to adjoin component gene segments. Each candidate gene segment for recombination is flanked by a recombination signal sequence (RSS), composed of heptamer and nonamer motifs separated by 12 or 23 base pairs. To initiate V(D)J recombination, the recombination activating proteins RAG1 and RAG2 create DNA double-stranded breaks between a 12/23-RSS pair and their adjoining gene segments. The basis for selection of individual RSSs during each V(D)J recombination event is not well understood due, in part, to the wide-spread distribution of the semi-conserved RSSs across the AgR loci. Using publicly-available data for V(D)J recombination efficiencies on randomized 12-RSSs, we first built a neural network model that delineates how changes in sequence at certain positions in the RSS affects recombination efficiency. Second, to interpret the model's decision-making process, we repurposed the game theoretic SHapley Additive exPlanations (SHAP) approach, with the results illustrating how nucleotides at pairwise positions in the heptamer provide synergistic contributions to recombination efficiency. Third, we trained a nonamer-informed neural network model with varied nonamer RSS substrates, and subsequently identified interdependent effects between the heptamer and nonamer regions on recombination efficiency.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
(A) Flowchart illustrating an overview of the steps taken throughout this study. (B) Diagram of canonical 12-RSS DNA sequence (above white line) and nomenclature of nucleotide position (below white line). The canonical 23-RSS heptamer and nonamer DNA sequence is the same as the 12-RSS but separated by 23 bp. (C) Sequence logo of the partially randomized N(H4-S2) 12-RSS substrate used in the SARP-seq experiment. (D) High, low, or no read counts of the 12-RSSs in the complete iSeq1 N(H4-S2) input dataset.

**Figure 2.**
H4S2 model development and testing. (A) Flowchart illustrating H4S2 model development, training, and testing. Through multiple rounds of k-fold cross validation, each round attempted to optimize one hyperparameter and stratified sampling ensured a consistent distribution of input data for each of the 20 models. The optimized hyperparameters were used to train the H4S2 model. (B) Predictions of the training, validation, and test datasets from H4S2 model training. (C and D) H4S2 model predictions of the SARP-seq N(H4-S2) dataset’s experimental replicate miSeq1 (in panel C) and iSeq2 (in panel D). (E) FVAF score (; where , (F) Spearman-rank order correlation, and (G) RMSE for SARP-seq N(H4-S2) datasets’ training experimental replicate; iSeq1 (left bar in orange) and testing experimental replicates miSeq1 and iSeq2 (middle and right bars in blue).

formula image — **Figure 2.**
H4S2 model development and testing. (A) Flowchart illustrating H4S2 model development, training, and testing. Through multiple rounds of k-fold cross validation, each round attempted to optimize one hyperparameter and stratified sampling ensured a consistent distribution of input data for each of the 20 models. The optimized hyperparameters were used to train the H4S2 model. (B) Predictions of the training, validation, and test datasets from H4S2 model training. (C and D) H4S2 model predictions of the SARP-seq N(H4-S2) dataset’s experimental replicate miSeq1 (in panel C) and iSeq2 (in panel D). (E) FVAF score (; where , (F) Spearman-rank order correlation, and (G) RMSE for SARP-seq N(H4-S2) datasets’ training experimental replicate; iSeq1 (left bar in orange) and testing experimental replicates miSeq1 and iSeq2 (middle and right bars in blue).

**Figure 3.**
(A) A scatter plot of the H4S2 model prediction and the RIC score, with a Spearman rank order correlation of 0.74 ρ. (B) A scatter plot of the log transformed H4S2 model prediction and the RIC score, with a Spearman rank order correlation of 0.74 ρ and Pearson correlation, r = 0.76. Four vertical striations are highlighted that are sampled near RIC scores of −17 (Group1), −20 (Group2), −26 (Group3), and −32 (Group4). (C) A set of descriptive information on the sampled vertical striations, include the top and bottom H4-S2 sequences within the striation, the range of the H4S2 model prediction, the range of the RIC score, and the sequence logo for the striation.

**Figure 4.**
Analysis of the sequence alignment’s effect on melting temperature for all 4 nt long k-mers. All possible 4-mers were generated (n = 256). Then each 4-mer was positioned within the H4-S2 region to give three different alignments, referred to as alignment -1 (4-mer at H4-H7, colored yellow, ‘4-mer’NN), alignment 0 (4-mer at H5-S1, colored green, N’4-mer’N), and alignment + 1 (4-mer at H6-S2, colored lavender, NN’4-mer’). Independent linear regressions were performed on all three alignments for each 4-mer sequence, where the regression was conducted between the log transformed H4S2 model prediction and the melting temperature. (A) An example of T_m analysis for different alignment motifs is shown using the 4-mer ‘ACGT’. Sequences that match each alignment (as noted in the plot’s legend) were pulled. Each alignment set contains 16 different sequences. Linear regression was performed on each alignment set, and the slope, R², and centroid for each set determined. The resulting values for the ACGT alignments and the log(H4S2 prediction) are listed in the table. (B) A box plot of the T_m centroid for the distribution of samples that match a given 4-mer sequence and alignment motif. (C) A box plot of the log transformed H4S2 model prediction centroid for the distribution of samples that match a given 4-mer sequence and alignment motif. (D) A box plot of the slope of each regression. Purple (B and C) and cyan (D) lines connect the same 4-mer sequence through the three different alignments (−1, 0, +1). Together, all 4-mer sequences, regardless of sequence content or alignment, showed a consistently negative relationship with melting temperature within the H4-S2 region (See Supplementary Dataset S2 for individual regression results). (E) A scatter plot of each regression line’s slope and R² fit metric for all possible 4-mer sequences and their three alignments. The red dashed circle highlights the +1 alignment values with the higher T_m dependence.

**Figure 5.**
(A) The contribution of each feature’s SHAP value for every 12-RSS in the complete experimental iSeq1 dataset (Supplementary Dataset S1) to the H4S2 model’s prediction. Nucleotides present in the 12-RSS sequence at a particular position are shown as red symbols, while nucleotides absent in the sequence are represented as blue symbols. (B) The mean of the absolute value of the SHAP values, representing feature importance and total contribution or nondirectional change to the H4S2 model’s prediction. (C) The sequence of the 12-RSS heptamer and two adjoining spacer positions that were used in the plasmid substrates in the fluorescence-based V(D)J recombination assay, where (black) identifies sequence identity with the canonical sequence, and (teal) indicates nucleotides which varied from the canonical 12-RSS (shown in Fig. 1B). (D) V(D)J recombination on the plasmid substrates containing the indicated 12-RSSs and measured by the fluorescence-based V(D)J recombination assays. V(D)J recombination activity is plotted as %GFP-positive cells for each 12-RSS-containing plasmid. The sequences shown on the horizontal axis are the H4-S2 sequences in each 12-RSS. The negative control used the AGTG|AT plasmid substrate in the absence of a RAG1 expression vector in the recombination assay. Pearson r= 0.97 (E) Relationship between H4S2 model prediction and the independently measured recombination activity of the five separate 12-RSSs. (F) Cumulative SHAP values corresponding to the H4S2 model’s prediction for independently measured 12-RSSs. The asterisks denote the P-value range for an ordinary one-way ANOVA with Dunnett’s multiple comparisons test, where ns is not significant; *P < 0.05; **P < 0.01; and ****P < 0.0001.

**Figure 6.**
(A) A Multimodal pairwise comparison of two features’ SHAP values (Feature 1: position H6_T and Feature 2: position H7_G). The modes are explained by the combination of binary features encodings where green is 1,1; light green is 1,0; pink is 0,1; and yellow is 0,0. Each black arrow is a CRV fit by PCA and characterizes a single modality of the multimodal joint distribution. Histogram plots along each axis displays each feature’s SHAP values. (B) Polar plot projection of all cooperative relationship vectors between every combination of two features and their four combination of binary feature encodings (0,0; 0,1; 1,0; and 1,1) (blue). Vectors from Feature 1: position H7_G and Feature 2: position H6_T comparison are colored in yellow. (C) The length and angle plot of all CRVs between each feature and every other feature. CRVs are colored by their feature combination where green is 1,1; light green is 1,0; pink is 0,1; and yellow is 0,0. The histograms along each axis displays the distribution of CRV’s length and the distribution of CRV’s angle. The table above the plot quantifies the number of CRVs present in each quadrant by each feature combination. The points with red outline are CRVs that correspond with Feature 1: position H6_T and Feature 2: position H7_G. With reverse ordering and a blue outline are CRVs that correspond with Feature 1: position H7_G and Feature 2: position H6_T. (D–I) grouping CRVs by their shared feature 1 (e.g. H4_A, ##_N). The cumulative summation of the vector length as the vector angle sweeps from 225° counterclockwise around the circle, for the feature 1 and all its pairwise comparisons. The cumulative summation distribution for (D) position H4, (E) position H5, (F) position H6, (G) position H7, (H) position S1, and (I) position S2 and the 4 features at each position derived from one-hot encoding. Supplementary Figure S11 shows the vector length and angle plots for each position and its 4 nucleotide features prior to cumulative summation. (D–I) Plot regions with vertical line hatching patterns highlights Q1 and CRV’s with positive cooperation, and regions with dotted hatching patterns indicate Q3 and CRVs with negative cooperation.

**Figure 7.**
(A) Flowchart of training and testing the H4S2_cNon model. The table shows the degenerate 12-RSS sequences in the CF1-, Pax3-, and LMO2- SARP-seq N(H4-H7)K(S2) dataset. (B) Recombination activity level of the 12-RSSs in the cryptic nonamer dataset where the 12-RSSs ranked by average read count between the 3 nonamer datasets is plotted on the x-axis, and the y-axis is the min–max normalized read count for each nonamer subset. (C) Cumulative summation of the nonamer-wise normalized read count of each nonamer in the cryptic nonamer datasets. (D) Left: The relationship between the H4S2 model predictions and the cryptic nonamer’s dataset-wide normalized read counts, where the entire dataset is min–max normalized without prior grouping. Right, FVAF score for the scatter plot. (E) Left: The relationship between the H4S2 model prediction and the cryptic nonamer dataset’s nonamer-wise normalized read counts. Right, FVAF score for the scatter plot. (F) Left: The relationship between the H4S2_cNon model prediction and the cryptic nonamer dataset’s dataset-wide normalized read counts. Right: FVAF score for the scatter plot. (G) Sequences of externally tested 12-RSS sequences. Black boxes are nucleotide positions that were constant across all training samples and were not encoded as inputs for the model. Light blue boxes are “informed nucleotide” positions that were included in the training samples and were encoded as inputs for the model. Gold boxes are “uninformed nucleotide” positions that were not included in the training datasets, highlighting nucleotides the H4S2-cNon model had not previously seen during training. (H) %GFP positive cells measured by the fluorescence-based V(D)J recombination assay with each 12-RSS measured individually. The asterisks denote the P-value range for an ordinary one-way ANOVA with Dunnett’s multiple comparisons test, where ns is not significant; **P < 0.01; ***P < 0.001, and ****P < 0.0001. (I) A Pearson r = 0.70 relationship between H4S2_cNon model prediction and the independently measured recombination activity of seven different 12-RSSs.

**Figure 8.**
(A) SHAP values for the H4-S2 features of the H4S2_cNon model predictions for the CF1 nonamer sequences in the dataset. (B) The mean absolute value of SHAP values for the H4-S2 features for the CF1 nonamer sequences in the dataset, i.e. the average nondirectional change on the model's prediction. The dashed red line shows the CF1’s global mean of all H4-S2 features absolute SHAP values. (C) SHAP values for the H4-S2 features of the H4S2_cNon model prediction for the Pax3 nonamer sequences in the dataset. (D) The mean absolute value of SHAP values for the H4-S2 features for the Pax3 nonamer sequences in the dataset. The dashed red line shows the Pax3’s global mean of all H4-S2 feature absolute SHAP values. (E) SHAP values for the H4-S2 features of the H4S2_cNon model prediction for the LMO2 nonamer sequences in the dataset. (F) The mean absolute value of SHAP values for the H4-S2 features for the LMO2 nonamer sequences in the dataset. The dashed red line shows the LMO2’s global mean of all H4-S2 feature absolute SHAP values. (G) Kolmogorov–Smirnov test statistics for two-way comparisons between two of the three nonamer’s rescaled SHAP values by dividing by each nonamer’s global mean.

**Figure 9.**
(A) Graphical summary of the overall results. (i) At top is summarized the major functional contributions of three consecutive dinucleotides (H4-H5, H6-H7, and S1-S2) to V(D)J recombination activity, as described in the text. (ii) The four nucleotides are displayed on a grid for every position. The FILL COLOR of each nucleotide and its associated bar's fill level corresponds to the measured strength of the first-order interactions by way of the average absolute change Mean(|SHAP value|) that represents the impact that a nucleotide has on the H4S2 model prediction (see Fig. 5B). The OUTLINED COLOR that frames each nucleotide:position indicates which nucleotides share similar profiles of pairwise interactions with all other input features and that the model does not differentiate from one another. The groupings were derived by way of comparing every position’s four nucleotides and their CRVs by way of cumulative summation (see Fig. 6D–I). The green asterisk indicates potential candidates for local long-distance interactions between the heptamer and nonamer region (see Fig. 8G; where KS metric is ≥0.20). (**iii**) The CRV groupings of the similar computationally behaved nucleotides represented by their corresponding IUPAC ambiguous DNA codes. (iv) A legend demarking the location and meaning of the color scheme of each nucleotide:position’s fill color, outline color, and green asterisk. (B) A set of heatmaps for each position in the H4-S2 region (see Supplementary Fig. S7A) reduced into minimal categories using CRV second-order interaction groupings (grouped by outline color in Fig. 9A, iv). Each heatmap is the average change in the H4S2 model prediction for the reduced category of possible point mutations at that position by way of simulated mutation of the vertical axis's degenerate nucleotide identity to the horizontal axis’s degenerate nucleotide identity.

See this image and copyright information in PMC

References

1. Rees AR Understanding the human antibody repertoire. Mabs. 2020; 12:1729683. 10.1080/19420862.2020.1729683. - DOI - PMC - PubMed
1. Schatz DG, Swanson PC V(D)J recombination: mechanisms of initiation. Annu Rev Genet. 2011; 45:167–202. 10.1146/annurev-genet-110410-132552. - DOI - PubMed
1. Rodgers KK Riches in RAGs: revealing the V(D)J recombinase through high-resolution structures. Trends Biochem Sci. 2017; 42:72–84. 10.1016/j.tibs.2016.10.003. - DOI - PMC - PubMed
1. Schatz DG, Zhang Y, Xiao J et al.. Honjo T, Reth M, Radbruch A, Alt F, Martin A Molecular Biology of B Cells. 2024; 3rd editionCambridge, Massachusetts, United States: Academic Press; 13–57.
1. Hoolehan W, Harris JC, Rodgers KK Molecular mechanisms of DNA sequence selectivity in V(D)J recombination. ACS Omega. 2023; 8:34206–14. 10.1021/acsomega.3c05601. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

HR21-142/Oklahoma Center for Advancement in Science and Technology

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Building a neural network model to define DNA sequence specificity in V(D)J recombination

Affiliations

Building a neural network model to define DNA sequence specificity in V(D)J recombination

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources