. 2015 Feb 24;112(8):E862-70.

doi: 10.1073/pnas.1417683112. Epub 2015 Feb 9.

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

Daniel Gadala-Maria¹, Gur Yaari², Mohamed Uduman³, Steven H Kleinstein⁴

Affiliations

¹ Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511;
² Department of Pathology and Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan 5290002, Israel.
³ Department of Pathology and.
⁴ Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511; Department of Pathology and Department of Immunobiology, Yale University School of Medicine, Yale University, New Haven, CT 06511; and steven.kleinstein@yale.edu.

PMID: 25675496
PMCID: PMC4345584
DOI: 10.1073/pnas.1417683112

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

Daniel Gadala-Maria et al. Proc Natl Acad Sci U S A. 2015.

. 2015 Feb 24;112(8):E862-70.

doi: 10.1073/pnas.1417683112. Epub 2015 Feb 9.

Authors

Daniel Gadala-Maria¹, Gur Yaari², Mohamed Uduman³, Steven H Kleinstein⁴

Affiliations

¹ Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511;
² Department of Pathology and Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan 5290002, Israel.
³ Department of Pathology and.
⁴ Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511; Department of Pathology and Department of Immunobiology, Yale University School of Medicine, Yale University, New Haven, CT 06511; and steven.kleinstein@yale.edu.

PMID: 25675496
PMCID: PMC4345584
DOI: 10.1073/pnas.1417683112

Abstract

Individual variation in germline and expressed B-cell immunoglobulin (Ig) repertoires has been associated with aging, disease susceptibility, and differential response to infection and vaccination. Repertoire properties can now be studied at large-scale through next-generation sequencing of rearranged Ig genes. Accurate analysis of these repertoire-sequencing (Rep-Seq) data requires identifying the germline variable (V), diversity (D), and joining (J) gene segments used by each Ig sequence. Current V(D)J assignment methods work by aligning sequences to a database of known germline V(D)J segment alleles. However, existing databases are likely to be incomplete and novel polymorphisms are hard to differentiate from the frequent occurrence of somatic hypermutations in Ig sequences. Here we develop a Tool for Ig Genotype Elucidation via Rep-Seq (TIgGER). TIgGER analyzes mutation patterns in Rep-Seq data to identify novel V segment alleles, and also constructs a personalized germline database containing the specific set of alleles carried by a subject. This information is then used to improve the initial V segment assignments from existing tools, like IMGT/HighV-QUEST. The application of TIgGER to Rep-Seq data from seven subjects identified 11 novel V segment alleles, including at least one in every subject examined. These novel alleles constituted 13% of the total number of unique alleles in these subjects, and impacted 3% of V(D)J segment assignments. These results reinforce the highly polymorphic nature of human Ig V genes, and suggest that many novel alleles remain to be discovered. The integration of TIgGER into Rep-Seq processing pipelines will increase the accuracy of V segment assignments, thus improving B-cell repertoire analyses.

Keywords: B-cell repertoire; adaptive immunity; next-generation sequencing; somatic hypermutation; variable gene segment.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Overview of the TIgGER workflow. IMGT/HighV-QUEST is used to determine initial V(D)J assignments (step 1). TIgGER uses these initial gene segment assignments to analyze mutation patterns and detect a putative set of novel alleles (step 2). The germline gene segment database is then extended by adding these novel alleles to improve the initial V(D)J assignments (step 3). The extended V(D)J assignments are then analyzed to determine the genotype of a subject, and generate their personalized germline database (step 4). A final set of V allele assignments is then made (step 5).

**Fig. 2.**
Mutation frequencies of IGHV positions. The mutation frequency of each IMGT-numbered nucleotide position was determined for sequences that best aligned to *IGHV1-2*02* in subject hu410143. Somatic mutations were determined through comparison with the germline sequence reported by IMGT/HighV-QUEST, and sequences that were assigned to multiple alleles including *IGHV1-2*02* were included in the analysis. (*Left*) The mutation frequency plotted as a function of IMGT-numbered nucleotide position. (*Right*) The mutation frequency plotted as a function of predicted mutability under the S5F targeting model.

**Fig. 3.**
Mutation patterns for polymorphic positions in IGHV sequences. The pattern of mutation accumulation in *IGHV1-2*02* and a hypothetical unknown allele of *IGHV1-2*02* (containing a polymorphism at position 163) was simulated as described in *Methods*. The fraction of sequences carrying a mutation at each IMGT position (lines) was then determined for groups of sequences sharing the same total IGHV mutation count, assuming that the subject was homozygous (*Upper Left*) or heterozygous (*Lower Left*) for the unknown allele. Equal allele use was assumed for the heterozygous case. The same analysis was performed on experimentally observed sequences that aligned to *IGHV1-2*02* from subjects M5 (*Upper Center*) and hu420143 (*Lower Center*). For these experimental data, the fraction of sequences carrying a mutation at each IMGT position (points), irrespective of total IGHV mutation count, was also analyzed (*Right*). IMGT position 163 is indicated by an arrow in all panels.

**Fig. 4.**
Distances between known alleles. For each IGHV germline allele sequence in the IMGT database, the Hamming distance to every other germline allele was calculated to determine the nearest allele. Gaps and degenerate alleles were excluded from the distance calculation, and pairs of alleles with distance zero were excluded altogether.

**Fig. 5.**
Sensitivity of polymorphism detection method. For each allele assignment given to at least 500 sequences in a subject, matching sequences were all reassigned to another allele of that gene. The TIgGER polymorphism detection method was then applied, to test whether the positions required to recreate the artificially excluded known allele could be detected. This analysis was performed for all alleles in samples derived from the 454 sequencing platform, excluding those in which TIgGER had previously identified polymorphisms. Horizontal bars indicate mean sensitivity across the three subjects tested. For each subject, the number of alleles falling into each distance group is indicated along the bottom.

**Fig. 6.**
The influence of allele assignment frequency cut-offs on IGHV genotype zygocity. TIgGER was used to determine subject-specific IGHV genotypes using different values for the allele assignment frequency cut-off (i.e., fraction of assignments to a gene segment that are required to be composed of a single allele to be included in the genotype). For each of the three 454 datasets, the number of alleles included in the inferred IGHV genotype is shown as a function of the allele assignment frequency cut-off (*Upper*). For PGP1, the distribution of allele assignments is shown for each gene included in the inferred genotype (*Lower*). For each bar, the lightest gray represents the most common allele, darkest gray the second most common, medium gray the third, and white for all others.

**Fig. 7.**
IGHV genotyping greatly improves allele assignments. The percentage of sequences with multiple IGHV allele assignments before and after genotype-based allele reassignment was determined for the three subjects sequenced by 454 (*Upper*). The percentage of sequences which could not be assigned to genotype alleles before and after genotype-based allele reassignment for the same three subjects was also determined (*Lower*).

See this image and copyright information in PMC

References

1. Lefranc MP. Nomenclature of the human immunoglobulin heavy (IGH) genes. Exp Clin Immunogenet. 2001;18(2):100–116. - PubMed
1. Munshaw S, Kepler TB. SoDA2: A hidden Markov model approach for identification of immunoglobulin rearrangements. Bioinformatics. 2010;26(7):867–872. - PMC - PubMed
1. Muramatsu M, et al. Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme. Cell. 2000;102(5):553–563. - PubMed
1. Papavasiliou FN, Schatz DG. Somatic hypermutation of immunoglobulin genes: Merging mechanisms for genetic diversity. Cell. 2002;109(Suppl):S35–S44. - PubMed
1. Watson CT, Breden F. The immunoglobulin heavy chain locus: Genetic variation, missing data, and implications for human disease. Genes Immun. 2012;13(5):363–373. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

Affiliations

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources