Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 17;14(10):e1006388.
doi: 10.1371/journal.pcbi.1006388. eCollection 2018 Oct.

Predicting B cell receptor substitution profiles using public repertoire data

Affiliations

Predicting B cell receptor substitution profiles using public repertoire data

Amrit Dhar et al. PLoS Comput Biol. .

Abstract

B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same "clonal family") are released from the germinal center; their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called "substitution profiles", are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method "Substitution Profiles Using Related Families" (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on two external datasets. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Amino acid substitution profiles viewed from three different perspectives.
High-throughput sequencing data (HTS data) yields large amounts of VDJ sequences, but because of uneven sampling many CFs will be sampled just once, resulting in poor representations of the amino acid substitution profiles of those true CFs. “Substitution Profiles Using Related Families” (SPURF) is a statistical framework that integrates large scale Rep-Seq data to predict amino acid substitution profiles for singleton CFs. In vivo affinity maturation will test many different mutations and the resulting CFs reflect the amino acid substitution profiles that we attempt to predict.
Fig 2
Fig 2. Model overview figure.
SPURF uses a per-site linear combination of substitution profiles from diverse sources to predict complete substitution profiles from a single member of a CF. At the top are the different profiles that serve as inputs to the model, some directly related to the naive sequence (X^naiveAA and X^neut), and others partitions of the public Rep-Seq datasets (X^vgene and X^vsubgrp). To predict a substitution profile, a weighted average is taken over the input sequence X and external profiles X*={X^naiveAA,X^vgene,X^neut,X^vsubgrp} (see the dashed line bubble). The vertical blue arrow indicates that the weighted average (in the dashed line bubble) occurs at each of the 149 AHo positions. Once a predicted profile is generated, this is compared to ground truth using either L2 error or Jaccard similarity as a performance metric. The α vectors are estimated by optimizing the objective function, which also includes a statistical regularization term to prevent overfitting.
Fig 3
Fig 3. A stacked barplot of the estimated parameter values of α from the best regularized L2 model.
For convenience, we aggregate the estimates of α associated with X^vgene and X^vsubgrp (blue) and with X^naiveAA and X^neut (red). The black vertical lines represent the boundaries between the different CDRs and FWKs.
Fig 4
Fig 4. The model performance results across the different antibody regions on the model fitting test dataset and the Briggs validation dataset.
In these plots, we compare the performances from our best models to the baseline predictive performances using only the input sequence (i.e. model predictions with all parameter values of α set to 0). The error bars show bootstrap standard errors.
Fig 5
Fig 5. Positional profile weights α mapped to an antibody protein structure (PDB: 5X8L).
The antigen (PD-L1) appears as a purple surface at the top of the images, the light chain appears in white cartoon, and the heavy chain is displayed using a blue to red color gradient; the grey dashed lines mark the CDR loops. The color gradient represents the possible values of profile weights in α and goes from blue at a zero weight to red at the maximum weight for the profile. The display in panels B and C is rotated relative to panel A to better show results for CDR1 and CDR3; as a consequence, the CDR2 loop is hidden behind the CDR1. Panel A shows that the input sequence has high weight at the CDR1 and CDR2, panel B illustrates that the naive sequence and the neutral substitution profile have high weight at the CDR3 and FWK4, and panel C demonstrates that the V gene and V subgroup profiles are highly weighted in parts of the CDR1 but more generally in the FWKs, especially at the heavy and light chain interface.

References

    1. Igawa T, Tsunoda H, Kuramochi T, Sampei Z, Ishii S, Hattori K. Engineering the variable region of therapeutic IgG antibodies. mAbs. 2011;3(3):243–252. 10.4161/mabs.3.3.15234 - DOI - PMC - PubMed
    1. Clark RH, Latypov RF, De Imus C, Carter J, Wilson Z, Manchulenko K, et al. Remediating agitation-induced antibody aggregation by eradicating exposed hydrophobic motifs. mAbs. 2014;6(6):1540–1550. 10.4161/mabs.36252 - DOI - PMC - PubMed
    1. Casaz P, Boucher E, Wollacott R, Pierce BG, Rivera R, Sedic M, et al. Resolving self-association of a therapeutic antibody by formulation optimization and molecular approaches. mAbs. 2014;6(6):1533–1539. 10.4161/19420862.2014.975658 - DOI - PMC - PubMed
    1. Courtois F, Agrawal NJ, Lauer TM, Trout BL. Rational design of therapeutic mAbs against aggregation through protein engineering and incorporation of glycosylation motifs applied to bevacizumab. mAbs. 2016;8(1):99–112. 10.1080/19420862.2015.1112477 - DOI - PMC - PubMed
    1. Geoghegan JC, Fleming R, Damschroder M, Bishop SM, Sathish HA, Esfandiary R. Mitigation of reversible self-association and viscosity in a human IgG1 monoclonal antibody by rational, structure-guided Fv engineering. mAbs. 2016;8(5):941–950. 10.1080/19420862.2016.1171444 - DOI - PMC - PubMed

Publication types