Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct;42(10):1606-1614.
doi: 10.1038/s41587-023-02057-3. Epub 2024 Jan 2.

Characterization and visualization of tandem repeats at genome scale

Affiliations

Characterization and visualization of tandem repeats at genome scale

Egor Dolzhenko et al. Nat Biotechnol. 2024 Oct.

Abstract

Tandem repeat (TR) variation is associated with gene expression changes and numerous rare monogenic diseases. Although long-read sequencing provides accurate full-length sequences and methylation of TRs, there is still a need for computational methods to profile TRs across the genome. Here we introduce the Tandem Repeat Genotyping Tool (TRGT) and an accompanying TR database. TRGT determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It also reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool. Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. In six samples with known repeat expansions, TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution than existing methods. Additionally, we released a database with allele sequences and methylation levels for 937,122 TRs across 100 genomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests

E.D., G.D.S.B., T.M., W.J.R., C.K., Z.K., K.P.C., A.W. and M.A.E. are employees and shareholders of Pacific Biosciences. F.J.S. received research support from Illumina, Pacific Biosciences, Nanopore and Genentech. The remaining authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. An overview of TRGT and TRVZ.
a, Input to TRGT consists of HiFi reads and a list of repeat definitions. b, TRGT determines consensus repeat alleles. c, TRGT uses the pre-specified structure of the TR region to locate individual motif copies in each repeat allele. d, More complex repeat regions are specified with HMMs. e, Overview of key fields in TRGT’s output. f, TRVZ generates plots that display repeat alleles and reads aligning to them, with optional methylation.
Fig. 2 |
Fig. 2 |. TRGT benchmarks.
a, Examples of a consistent genotype, an off-by-one error and a larger error. b, A histogram stratifying the distribution of Mendelian errors by motif length. c, Edit distances between repeat alleles estimated by TRGT and an HG002 genome assembly. d, The proportion of the expanded FMR1 repeat distribution captured by TRGT’s size intervals from subsampled 500-fold depth NoAmp targeted sequence data (using n = 100 replicates for each depth) (the center line is at the median; the box extends from the first to the third quartile; and the whiskers extend to the farthest data point within 1.5× of the interquartile range from the box). e, Density of TRGT’s size intervals.
Fig. 3 |
Fig. 3 |. Genetic and epigenetic variation of n = 937,122 TR regions across 100 HPRC samples.
a, Distribution of length polymorphism scores defined as the number of alleles of distinct length per 100 samples. b, Distribution of allele CPSs. c, Length and composition z-scores for known pathogenic repeats. d, Distribution of allele mean methylation levels stratified by CpG density (the center line is at the median; the box extends from the first to the third quartile; and the whiskers extend to the farthest data point within 1.5× of the interquartile range from the box). e, Mean methylation levels of TRs overlapping CpG islands.
Fig. 4 |
Fig. 4 |. Genetic variation of RFC1 repeat alleles.
a, An HMM representing the population structure of the RFC1 TR derived from a priori known motifs. b, A TRVZ plot depicting both alleles of the RFC1 repeat in the HG04228 sample. c, A heat map depicting the span of each motif (columns) on each allele (rows); each cluster of alleles is associated with the color of its dominant motif. d, An example allele from each cluster. e, Lengths of alleles belonging to each cluster.
Fig. 5 |
Fig. 5 |. Genetic and epigenetic variation of FMR1 repeat.
a, Distribution of FMR1 allele sizes in 100 HPRC samples. b,c, TRVZ plots of FMR1 repeat in the HG04184 (b) and HG00438 (c) samples, respectively, showing premutation alleles. d, TRVZ plot of FMR1 repeat in the HG01099 male sample displaying CpG methylation. e, Distribution of median methylation levels for HG01099 reads spanning FMR1 repeat. f, Distributions of median methylation levels for FMR1 reads across all male samples. g, TRVZ plot of FMR1 repeat in HG03831 female sample displaying CpG methylation. h, Distribution of median methylation levels for HG03831 reads spanning FMR1 repeat. i, Distributions of median methylation levels for FMR1 reads across all female samples. j, Premutation repeat allele from a prefrontal cortex sample from a female donor (short allele not shown). k, Premutation repeat allele from a prefrontal cortex sample from a male donor. l, Fully expanded repeat allele from a prefrontal cortex sample from a male donor. m, Methylation profile of prefrontal cortex samples.

References

    1. English A. et al. Benchmarking of small and large variants across tandem repeats. Preprint at bioRxiv 10.1101/2023.10.29.564632 (2023). - DOI
    1. Caron NS, Wright GEB & Hayden MR Huntington disease. In GeneReviews® (eds. Adam MP et al.) (Univ. Washington, 1998).
    1. Siddique N. & Siddique T. Amyotrophic lateral sclerosis overview. In GeneReviews® (eds. Adam MP et al.) (Univ. Washington, 2001). - PubMed
    1. Hunter JE, Berry-Kravis E, Hipp H. & Todd PK FMR1 disorders. In GeneReviews® (eds. Adam MP et al.) (Univ. Washington, 1998). - PubMed
    1. Gymrek M. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet 48, 22–29 (2016). - PMC - PubMed

LinkOut - more resources