Targeted genotyping of variable number tandem repeats with adVNTR

Mehrdad Bakhtiari¹, Sharona Shleizer-Burko², Melissa Gymrek^{1

2}, Vikas Bansal³, Vineet Bafna¹

Affiliations

¹ Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA.
² Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA.
³ Department of Pediatrics, University of California, San Diego, La Jolla, California 92093, USA.

PMID: 30352806
PMCID: PMC6211647
DOI: 10.1101/gr.235119.118

Targeted genotyping of variable number tandem repeats with adVNTR

Mehrdad Bakhtiari et al. Genome Res. 2018 Nov.

. 2018 Nov;28(11):1709-1719.

doi: 10.1101/gr.235119.118. Epub 2018 Oct 23.

Authors

Mehrdad Bakhtiari¹, Sharona Shleizer-Burko², Melissa Gymrek^{1

2}, Vikas Bansal³, Vineet Bafna¹

Affiliations

¹ Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093, USA.
² Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA.
³ Department of Pediatrics, University of California, San Diego, La Jolla, California 92093, USA.

PMID: 30352806
PMCID: PMC6211647
DOI: 10.1101/gr.235119.118

Abstract

Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping Variable Number Tandem Repeats (VNTRs), composed of inexact tandem duplications of short (6-100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.

PubMed Disclaimer

Figures

**Figure 1.**
Read recruitment quality on Illumina reads. (A) Comparison of the recall (number of true recruited reads/number of true reads) of adVNTR read recruitment against BWA-MEM and Bowtie 2, as a function of VNTR length for 1775 VNTRs with different counts (31,788 tests). Each dot corresponds to a separate test. (B) Precision of read recruitment (number of true recruited reads/number of recruited reads).

**Figure 2.**
VNTR genotyping using PacBio data. (A) RU count estimation on simulated PacBio reads as a function of RU count and coverage for three medically relevant VNTRs: *INS* (RU length 14 bp), *CSTB* (12 bp), and *HIC1* (70 bp). adVNTR performance is compared to a naïve method. (B) The effect of RU length on count accuracy over 2944 VNTRs (30418 tests). (C) Mendelian consistency of genotypes at four VNTR loci in the Chinese Han and Ashkenazi trios. Note that *MAOA* results are consistent with its location on Chr X. (D) LR-PCR–based validation of genotypes at five disease-linked VNTRs in NA12878. Red arrows correspond to VNTR lengths estimated by multiplying predicted RU counts with RU lengths. (E) Fraction of consistent calls and number of calls across 2944 VNTRs in Ashkenazi Jew (AJ) and Chinese trios from GIAB and NCBI SRA. (F) Fraction of consistent calls allowing for off-by-one errors.

**Figure 3.**
VNTR genotyping using Illumina data. (A–D) Correctness of RU count prediction for 1775 coding VNTRs in the IlluminaSim data set, described by RU count discrepancy (A), haplotypes with correct estimates (B), correctness as a function of VNTR length (C), and RU length (D). (E) Consistency of adVNTR calls on the AJ trio WGS data from GIAB. The red line describes the cumulative number of calls made at specific posterior probability cutoffs. (F) Gel electrophoresis–based validation of adVNTR calls on five short VNTRs using WGS of individual NA12878 from GIAB. The red arrows correspond to VNTR lengths estimated by multiplying the RU lengths with the estimated RU counts.

**Figure 4.**
Population-scale genotyping of VNTRs. (A) RU count frequencies for the VNTR in *CCDC66* gene; (B) *CSTB* in African, Asian, and European population samples from The 1000 Genomes Project. RU counts of 4 and higher in *CSTB* are associated with myoclonal epilepsy.

**Figure 5.**
The VNTR HMM. The HMM is composed of three profile HMMs, one each for the left and right ﬂanking unique regions, and one in the middle to match multiple and partial numbers of RUs. The special states U_s (“Unit-Start”), and U_e (“Unit-End”) are used for RU counting. Dotted lines refer to special transitions for partial reads that do not span the entire region.

**Figure 6.**
Estimates of RU counts using recruited reads. (A) (k₁, k₂, k₃) = (1,3,1); RU count ≥5. (B) (k₁, k₂, k₃) = (0,3,1); RU count ≥4. (C) (k₁, k₂, k₃) = (0,3,0); RU count = 3.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. - PubMed
1. Au KF, Underwood JG, Lee L, Wong WH. 2012. Improving PacBio long read accuracy by short read alignment. PLoS One 7: e46679. - PMC - PubMed
1. Benedetti F, Dallaspezia S, Colombo C, Pirovano A, Marino E, Smeraldi E. 2008. A length polymorphism in the circadian clock gene Per3 inﬂuences age at onset of bipolar disorder. Neurosci Lett 445: 184–187. - PubMed
1. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Targeted genotyping of variable number tandem repeats with adVNTR

Affiliations

Targeted genotyping of variable number tandem repeats with adVNTR

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources