Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar;43(3):431-442.
doi: 10.1038/s41587-024-02225-z. Epub 2024 Apr 26.

Analysis and benchmarking of small and large genomic variants across tandem repeats

Affiliations

Analysis and benchmarking of small and large genomic variants across tandem repeats

Adam C English et al. Nat Biotechnol. 2025 Mar.

Abstract

Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.

PubMed Disclaimer

Conflict of interest statement

Competing interests: F.J.S. receives research support from Illumina, Genentech, PacBio and ONT. E.D. and M.A.E. are employees and shareholders of PacBio. S.K.M. is an employee and shareholder of ONT. W.D.C. has received free consumables from ONT. The other authors declare no competing interests.

Figures

Fig. 1 |
Fig. 1 |. Sequence contexts of TR catalog.
a, SOM using 4-mer frequencies per region with hue indicating mean G+C percentage. Dense known pathogenic neighborhoods are annotated with their most common motif using International Union of Pure and Applied Chemistry (IUPAC) codes (S = G|C, W = A|T). b, Number of TR regions per neuron. c, Average percentage of TR region sequence annotated as a homopolymer. d, Intersection of TR regions with UCSC microsatellite track exposes a neighborhood of microsatellites (top left). e, Visualization of UCSC segmental duplications track shows clustering in similar sequence contexts to SINEs in a (top middle). f, Map of TRs intersecting genes. g, Map of TRs overlapping promoters.
Fig. 2 |
Fig. 2 |. Location, sequence and length properties of the benchmark’s TR regions.
a, Karyoplot of TR regions. Top: TR regions included in the benchmark (red); bottom: catalog TR regions with HG002 variants that are ≥5 bp in length (blue). b, SOM heatmap representing the percentage of TR catalog regions per neuron contained in the benchmark with respect to Fig. 1b. c, Boxplot of HG002 allele deltas (sum of absolute variant lengths in base pairs) for 93,693 TR regions as a function of motif length (lower quartile, 25th percentile; upper quartile, 75th percentile; center, median; extrema, 1.5 times the interquartile range). In heterozygous regions, the maximum delta is used. d, TR allele delta length per region as a function of motif length. Contractions have a negative delta and expansions have a positive delta. Copy numbers greater than 30 are binned at either end of the histogram.
Fig. 3 |
Fig. 3 |. Benchmarking pipeline performance.
a,b, Size regime performance metrics for comparison tools (RTG vcfeval, Truvari bench and Truvari refine) on the HG002 TR benchmark against the alignment replicate for Tier1 (a) and Tier2 (b) regions. c, Pipeline schematic of Truvari operations for comparing sequence-resolved variants to the TR benchmark. Top: three commands for creating a benchmarking result and stratification report; left: illustration of Truvari phab variant harmonization; right: cartoon of Laytr stratification html report.
Fig. 4 |
Fig. 4 |. Diversity of TRs over 156 haplotypes at CODIS and known pathogenic loci.
a, Allele delta (sum of variant lengths) across four CODIS loci. HG002’s maternal and paternal alleles are indicated by orange data points. b, Allele delta of nine known pathogenic repeats. c–e, Distribution of haplotypes over CODIS PentaE locus (c) and known pathogenic JPH3 (d) and TCF4 (e) loci. For c–e, each row represents a unique-by-length haplotype (allele) and the y label represents the number of haplotypes with the allele. Blue squares indicate the TR motifs, gray squares indicate the non-motif sequences and white squares are gaps introduced by MSA. Gray squares upstream and downstream of the TR are the buffer sequences of the benchmark’s TR regions. Orange boxes indicate HG002’s maternal (M) and paternal (P) haplotypes (homozygous regions have one box) and green boxes indicate the GRCh38 reference allele (R). The allele count was determined by deduplicating haplotypes by length.

Update of

Similar articles

Cited by

References

    1. Levinson G & Gutman GA Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4, 203–221 (1987). - PubMed
    1. Fan H & Chu J-Y A brief review of short tandem repeat mutation. Genom. Proteom. Bioinform. 5, 7–14 (2007). - PMC - PubMed
    1. Shriver MD, Jin L, Chakraborty R & Boerwinkle E VNTR allele frequency distributions under the stepwise mutation model: a computer simulation approach. Genetics 134, 983–993 (1993). - PMC - PubMed
    1. Wright JM Mutation at VNTRs: are minisatellites the evolutionary progeny of microsatellites? Genome 37, 345–347 (1994). - PubMed
    1. Willems T et al. The landscape of human STR variation. Genome Res. 24, 1894–1904 (2014). - PMC - PubMed

LinkOut - more resources