Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 7;8(1):1437.
doi: 10.1038/s42003-025-08837-8.

Genotyping short tandem repeats across copy number alterations, aneuploidies, and polyploid organisms

Affiliations

Genotyping short tandem repeats across copy number alterations, aneuploidies, and polyploid organisms

Max A Verbiest et al. Commun Biol. .

Abstract

Short tandem repeats (STRs) are a rich source of genetic variation, but are difficult to genotype. While specialized repeat variant callers exist, they typically assume a euploid human genome. This means recent findings regarding phenotypic effects of STR variants in human health and disease cannot be readily extended to polyploid organisms or cancer, which is characterised by copy number alterations (CNAs). Here we present ConSTRain, a novel STR variant caller that explicitly accounts for the copy number of loci in its genotyping approach. We benchmark ConSTRain using a euploid human 100X whole genome sequencing sample where it calls STR allele lengths for over 1.7 × 106 loci in under 20 minutes with an accuracy of 98.28%. Subsequently, we show that ConSTRain resolves complex STR genotypes in an artificial trisomy 21 sample and a polyploid Dwarf Cavendish banana harbouring a large duplication. Finally, we analyse a microsatellite instable colorectal cancer tumoroid, where ConSTRain tackles CNAs and whole-genome duplications. ConSTRain is the first STR variant caller that allows for the investigation of repeats affected by CNAs, aneuploidies, and polyploid genomes. This unlocks the investigation of STRs across a wide range of contexts and organisms where they previously could not be easily studied.

PubMed Disclaimer

Conflict of interest statement

Competing interests: M.A. is an Editorial Board Member for Communications Biology, but was not involved in the editorial review of, nor the decision to publish this article. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. ConSTRain performance on Q100 benchmark.
A Distribution of normalised sequencing depth observed by ConSTRain across 167114 repeat loci in the 100X HG002 WGS sample. The x-axis shows the sequencing depth normalised by the copy number of repeat loci. The left y-axis shows the accuracy of allele length calls (blue line and dots). The right y-axis shows the proportion of loci (grey histogram). Note: only normalised depth values between 0 and 60 are shown for visual clarity. B Accuracy of unfiltered and filtered ConSTRain STR allele length calls for 100X WGS of HG002, as well as for the same sample downsampled to 30X and 10X depth of coverage. Note: y-axis starts at 0.75.
Fig. 2
Fig. 2. Genotyping STRs in a triploid M. acuminata sample with a large duplication on chr02.
A Consistency of STR genotypes between the HiSeq1500 and NextSeq500 samples for different normalised depth filtering thresholds. X-axis: STR period, y-axis: proportion of loci for which the inferred genotype matched exactly between the two alignments. B Distributions of the depth of coverage for STR loci normalised by copy number for STRs in the alignment of combined HiSeq1500 and NextSeq500 reads. The blue distribution shows the normalised depths for loci not affected by CNAs. The orange distribution shows the normalised depth reported for loci in the chr02 duplication when CNA information was not provided to ConSTRain. The green distribution shows normalised depth for the loci in the chr02 duplication when CNA information was provided to ConSTRain. Vertical dashed lines indicate filtering bounds that exclude the 2.5% of loci with the highest and the 2.5% of loci with the lowest depth of coverage in the overall sample.
Fig. 3
Fig. 3. Pairwise STR-based distances between four samples stemming from the same patient-derived tumoroid.
Each cell represents the comparison between two samples, with the colour and value of cells indicating the normalised distances between samples (average difference in allele length per locus).
Fig. 4
Fig. 4. ConSTRain overview and example.
(1) An STR locus is loaded from the input files. The locus reference information is parsed from the STR panel. The STR copy number is set based on the karyotype, and optionally updated if the STR is affected by a CNA. (2) Reads that completely span the STR region are extracted from the alignment file, and the length of the STR region in each read is determined. (3) The observed distribution is sorted, and at most as many allele lengths as the STR copy number are kept. (4) This yields the final observed allele length distribution. (5) Next, all possible genotypes are generated for the STR copy number and stored in matrix G. (6) From G, the matrix D is generated by multiplying it with the total number of mapped reads (51 in the example) divided by the STR copy number (3 in the example). Each row in D corresponds to the expected allele length distribution of one of the genotypes in G. (7) The expected distribution with the lowest error to the observed distribution is found by taking the absolute difference between each row in D and the observed distribution, then (8) taking the sum of rows and finding the one with the lowest value. (9) The genotype in G with the lowest error is selected (10) and reported in the output. The inferred genotype of the STR locus in this example consists of an allele of 4 CAG units (present once), an allele of 5 CAG units (present once), and an allele of 8 CAG units (also present once).

References

    1. Verbiest, M. A. et al. Mutation and selection processes regulating short tandem repeats give rise to genetic and phenotypic diversity across species. J. Evolut. Biol.36, 321–336 (2023). - PMC - PubMed
    1. Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet.51, 1652–1659 (2019). - PMC - PubMed
    1. Shi, Y. et al. Characterization of genome-wide STR variation in 6487 human genomes. Nat. Commun.14, 2092 (2023). - PMC - PubMed
    1. Verbiest, M. A. et al. Short tandem repeat mutations regulate gene expression in colorectal cancer. Sci. Rep.14, 3331 (2024). - PMC - PubMed
    1. Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods14, 590–592 (2017). - PMC - PubMed

LinkOut - more resources