Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov;28(11):1709-1719.
doi: 10.1101/gr.235119.118. Epub 2018 Oct 23.

Targeted genotyping of variable number tandem repeats with adVNTR

Affiliations

Targeted genotyping of variable number tandem repeats with adVNTR

Mehrdad Bakhtiari et al. Genome Res. 2018 Nov.

Abstract

Whole-genome sequencing is increasingly used to identify Mendelian variants in clinical pipelines. These pipelines focus on single-nucleotide variants (SNVs) and also structural variants, while ignoring more complex repeat sequence variants. Here, we consider the problem of genotyping Variable Number Tandem Repeats (VNTRs), composed of inexact tandem duplications of short (6-100 bp) repeating units. VNTRs span 3% of the human genome, are frequently present in coding regions, and have been implicated in multiple Mendelian disorders. Although existing tools recognize VNTR carrying sequence, genotyping VNTRs (determining repeat unit count and sequence variation) from whole-genome sequencing reads remains challenging. We describe a method, adVNTR, that uses hidden Markov models to model each VNTR, count repeat units, and detect sequence variation. adVNTR models can be developed for short-read (Illumina) and single-molecule (Pacific Biosciences [PacBio]) whole-genome and whole-exome sequencing, and show good results on multiple simulated and real data sets.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Read recruitment quality on Illumina reads. (A) Comparison of the recall (number of true recruited reads/number of true reads) of adVNTR read recruitment against BWA-MEM and Bowtie 2, as a function of VNTR length for 1775 VNTRs with different counts (31,788 tests). Each dot corresponds to a separate test. (B) Precision of read recruitment (number of true recruited reads/number of recruited reads).
Figure 2.
Figure 2.
VNTR genotyping using PacBio data. (A) RU count estimation on simulated PacBio reads as a function of RU count and coverage for three medically relevant VNTRs: INS (RU length 14 bp), CSTB (12 bp), and HIC1 (70 bp). adVNTR performance is compared to a naïve method. (B) The effect of RU length on count accuracy over 2944 VNTRs (30418 tests). (C) Mendelian consistency of genotypes at four VNTR loci in the Chinese Han and Ashkenazi trios. Note that MAOA results are consistent with its location on Chr X. (D) LR-PCR–based validation of genotypes at five disease-linked VNTRs in NA12878. Red arrows correspond to VNTR lengths estimated by multiplying predicted RU counts with RU lengths. (E) Fraction of consistent calls and number of calls across 2944 VNTRs in Ashkenazi Jew (AJ) and Chinese trios from GIAB and NCBI SRA. (F) Fraction of consistent calls allowing for off-by-one errors.
Figure 3.
Figure 3.
VNTR genotyping using Illumina data. (AD) Correctness of RU count prediction for 1775 coding VNTRs in the IlluminaSim data set, described by RU count discrepancy (A), haplotypes with correct estimates (B), correctness as a function of VNTR length (C), and RU length (D). (E) Consistency of adVNTR calls on the AJ trio WGS data from GIAB. The red line describes the cumulative number of calls made at specific posterior probability cutoffs. (F) Gel electrophoresis–based validation of adVNTR calls on five short VNTRs using WGS of individual NA12878 from GIAB. The red arrows correspond to VNTR lengths estimated by multiplying the RU lengths with the estimated RU counts.
Figure 4.
Figure 4.
Population-scale genotyping of VNTRs. (A) RU count frequencies for the VNTR in CCDC66 gene; (B) CSTB in African, Asian, and European population samples from The 1000 Genomes Project. RU counts of 4 and higher in CSTB are associated with myoclonal epilepsy.
Figure 5.
Figure 5.
The VNTR HMM. The HMM is composed of three profile HMMs, one each for the left and right flanking unique regions, and one in the middle to match multiple and partial numbers of RUs. The special states Us (“Unit-Start”), and Ue (“Unit-End”) are used for RU counting. Dotted lines refer to special transitions for partial reads that do not span the entire region.
Figure 6.
Figure 6.
Estimates of RU counts using recruited reads. (A) (k1, k2, k3) = (1,3,1); RU count 5. (B) (k1, k2, k3) = (0,3,1); RU count 4. (C) (k1, k2, k3) = (0,3,0); RU count = 3.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. - PubMed
    1. Au KF, Underwood JG, Lee L, Wong WH. 2012. Improving PacBio long read accuracy by short read alignment. PLoS One 7: e46679. - PMC - PubMed
    1. Benedetti F, Dallaspezia S, Colombo C, Pirovano A, Marino E, Smeraldi E. 2008. A length polymorphism in the circadian clock gene Per3 influences age at onset of bipolar disorder. Neurosci Lett 445: 184–187. - PubMed
    1. Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. - PubMed

Publication types

LinkOut - more resources