Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 5;47(15):e90.
doi: 10.1093/nar/gkz501.

Profiling the genome-wide landscape of tandem repeat expansions

Affiliations

Profiling the genome-wide landscape of tandem repeat expansions

Nima Mousavi et al. Nucleic Acids Res. .

Abstract

Tandem repeat (TR) expansions have been implicated in dozens of genetic diseases, including Huntington's Disease, Fragile X Syndrome, and hereditary ataxias. Furthermore, TRs have recently been implicated in a range of complex traits, including gene expression and cancer risk. While the human genome harbors hundreds of thousands of TRs, analysis of TR expansions has been mainly limited to known pathogenic loci. A major challenge is that expanded repeats are beyond the read length of most next-generation sequencing (NGS) datasets and are not profiled by existing genome-wide tools. We present GangSTR, a novel algorithm for genome-wide genotyping of both short and expanded TRs. GangSTR extracts information from paired-end reads into a unified model to estimate maximum likelihood TR lengths. We validate GangSTR on real and simulated data and show that GangSTR outperforms alternative methods in both accuracy and speed. We apply GangSTR to a deeply sequenced trio to profile the landscape of TR expansions in a healthy family and validate novel expansions using orthogonal technologies. Our analysis reveals that healthy individuals harbor dozens of long TR alleles not captured by current genome-wide methods. GangSTR will likely enable discovery of novel disease-associated variants not currently accessible from NGS.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic of GangSTR method. Paired-end reads from an input set of alignments are separated into various read classes, each of which provides information about the length of the TR in the region. This information is used to find the maximum likelihood diploid genotype and confidence interval on the repeat length. Results are reported in a VCF file.
Figure 2.
Figure 2.
Four classes of informative read pairs. (A) Enclosing class: characteristic n corresponds to the number of repeat copies enclosed in the read. (B) n is modeled for different repeat lengths accounting for errors introduced during PCR. (C) Spanning class: characteristic Δ denotes the observed fragment length for a read pair. (D) Δ is modeled for different repeat lengths. Longer repeats give shorter observed fragment lengths. The red vertical dashed line gives the mean actual fragment length. (E) Fully Repetitive Read (FRR) class: characteristic Ω is the distance of the non-repetitive read from the repeat region. (F) Ω is modeled for different repeat lengths. Longer repeats give shorter observed Ω values. (G) Flanking class: characteristic k shows the number of copies extracted from the flanking read. (H) k is modeled for different repeat lengths. S1 and S2 give the start coordinates of each read in the pair relative to the beginning of the first flanking region. For A, C, E and G, F shows the length (bp) of the flanking region and the repeat is L bp long (A copies of a repeat of length m). For B, D, F and H, each color denotes a different underlying repeat length (blue = 10 copies, green = 20 copies, red = 40 copies, purple = 60 copies, gold = 80 copies, light blue = 200 copies).
Figure 3.
Figure 3.
Evaluation of TR genotypers on real and simulated data at pathogenic repeat expansions. (A) RMSE for each simulated locus. HTT=Huntington’s Disease; SCA=spinocerebellar ataxia; DM=Myotonic Dystrophy; C9ORF72=amyotrophic lateral sclerosis/frontotemporal dementia; FMR1=Fragile X Syndrome. TRs are sorted from left to right by ascending length of the pathogenic allele. The motif for each locus is specified in parentheses. (B) Comparison of true vs. estimated repeat number for each simulated genotype for SCA1. Gray dashed line gives the diagonal. (C) Comparison of true vs. estimated repeat number for each simulated genotype for SCA8. (D) Comparison of true versus estimated repeat number for HTT using real WGS data. (E). Comparison of true versus estimated repeat number for FMR1 using real WGS data. In all panels, red = GangSTR; blue = ExpansionHunter; black = Tredparse.
Figure 4.
Figure 4.
Genome-wide TR genotyping. (A) Composition of TRs in the hg19 reference genome. The x-axis gives the motif length and the y-axis (log10 scale) gives the number of TRs in the genome. Colored bars represent TRs overlapping various genomic annotations (blue = coding, orange = 5’ UTR, green = 3’ UTR, red = intronic, purple = intergenic). (B) Mendelian inheritance of GangSTR genotypes in a CEU trio as a function of the number of informative read pairs. Colors denote repeat lengths. Solid lines give mean Mendelian inheritance rate across all TRs, computed based on 95% confidence intervals as described in Methods. Dashed lines are computed after excluding loci where all three samples were homozygous for the reference allele. (C) Overlap between TRs genotyped by HipSTR and GangSTR. (D) Comparison of HipSTR and GangSTR genotypes. The x-axis and y-axis show the sum of the two allele lengths genotyped by HipSTR and GangSTR in bp relative to the hg19 reference genome (dosage), respectively. The size of the bubble represents the number of points at that coordinate.
Figure 5.
Figure 5.
Discovery and validation of genome-wide TR expansions. (A) Comparison of STRetch and GangSTR estimated repeat lengths. The x-axis shows the estimated repeat number returned by STRetch. The y-axis shows the estimated repeat number of the longest of two alleles reported as the maximum likelihood genotype by GangSTR. Only TRs called by both tools and passing all GangSTR filters are shown. The gray dashed line shows the diagonal. (B) Example sequence at a candidate TR expansion. The reference sequence and representative reads from PacBio (top) and ONT (bottom) for NA12878 are shown for a locus where GangSTR predicted a 48bp expansion from the reference genome. Instances of the repeat motif are shown in red. (C, D) For each of the TRs shown, left plots compare GangSTR genotypes to those predicted by long reads. Red dots give the maximum likelihood repeat lengths predicted by GangSTR and red lines give the 95% confidence intervals for each allele. Black histograms give the distribution of repeat lengths supported by PacBio (top) and ONT (bottom) reads. The black arrow denotes the length in hg19. The right plots show PCR product sizes for each TR as estimated using capillary electrophoresis. Left bands show the ladder and right bands show product sizes in NA12878. Green and purple bands show the lower and upper limits of the ladder, respectively. Red arrows and numbers give product sizes expected for the two alleles called by GangSTR.

References

    1. Yang Y., Muzny D.M., Reid J.G., Bainbridge M.N., Willis A., Ward P.A., Braxton A., Beuten J., Xia F., Niu Z. et al. .. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 2013; 369:1502–1511. - PMC - PubMed
    1. Bailey M.H., Tokheim C., Porta-Pardo E., Sengupta S., Bertrand D., Weerasinghe A., Colaprico A., Wendl M.C., Kim J., Reardon B. et al. .. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018; 173:371–385. - PMC - PubMed
    1. Benonisdottir S., Oddsson A., Helgason A., Kristjansson R.P., Sveinbjornsson G., Oskarsdottir A., Thorleifsson G., Davidsson O.B., Arnadottir G.A., Sulem G. et al. .. Epigenetic and genetic components of height regulation. Nat. Commun. 2016; 7:13490. - PMC - PubMed
    1. Chiang C., Scott A.J., Davis J.R., Tsang E.K., Li X., Kim Y., Hadzic T., Damani F.N., Ganel L., Montgomery S.B. et al. .. The impact of structural variation on human gene expression. Nat. Genet. 2017; 49:692–699. - PMC - PubMed
    1. Wrzeszczynski K.O., Felice V., Shah M., Rahman S., Emde A.K., Jobanputra V., Frank M.O., Darnell R.B.. Whole genome sequencing-based discovery of structural variants in glioblastoma. Methods Mol. Biol. 2018; 1741:1–29. - PubMed

Publication types