This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Sep 17:2025.09.14.676153.

doi: 10.1101/2025.09.14.676153.

Systematic evaluation of the impact of promoter proximal short tandem repeats on expression

Affiliations

¹ Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA.
² Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA.
³ HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, USA.
⁴ Department of Pediatrics, University of California, San Diego, La Jolla, California 92093, USA.

PMID: 41001006
PMCID: PMC12458280
DOI: 10.1101/2025.09.14.676153

Systematic evaluation of the impact of promoter proximal short tandem repeats on expression

Xuan Zhang et al. bioRxiv. 2025.

[Preprint]. 2025 Sep 17:2025.09.14.676153.

doi: 10.1101/2025.09.14.676153.

Authors

Affiliations

¹ Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA.
² Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA.
³ HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, USA.
⁴ Department of Pediatrics, University of California, San Diego, La Jolla, California 92093, USA.

PMID: 41001006
PMCID: PMC12458280
DOI: 10.1101/2025.09.14.676153

Abstract

Genetic variation at thousands of short tandem repeats (STRs), which consist of consecutive repeated sequences of 1-6bp, has been statistically associated with gene expression and other molecular phenotypes in humans. However, the causality and regulatory mechanisms for most of these STRs remains unknown. Massively parallel reporter assays (MPRA) enable testing the regulatory activity of a large number of synthesized variants, but have not been applied to STRs due to experimental and computational challenges. Here, we optimized an MPRA framework based on random barcoding to study the impact of variation in repeat copy number on expression. We first performed an MPRA on sequences derived from 30,516 promoter-proximal STR loci along with up to 152bp of genomic context, testing 3-4 variants with differing repeat copy numbers for each locus in HEK293T cells. We identified 1,366 loci with significant associations between repeat copy number and expression, which were enriched for positive effect sizes (P=2.08e-110). We then designed a second MPRA in which we performed deeper perturbations, including systematic manipulation of the repeat unit sequence, orientation, and copy number, with 200-300 perturbations for each of the 300 loci with the strongest signals. Our results revealed that the repeat unit sequence is the primary driver of differences in the relationship between copy number and expression across loci, whereas orientation and flanking sequence have weaker effects, primarily for AT-rich repeat units. The high resolution of these perturbations enabled us to detect non-linear effects, most notably for AAAC/GTTT repeats, which emerge only beyond a certain copy number threshold. Finally, we observed that a subset of STRs in our library show expression levels that are tightly linked with predicted DNA secondary structure formation. We repeated our perturbation MPRA in HeLa S3 cells under wildtype and RNase H1 knockdown conditions, which, via reduction in RNase H1 activity, are expected to hinder resolution of R-loops. This demonstrated that associations between copy number and expression at G-quadruplex-forming CCCCG/CGGGG repeats are particularly sensitive to loss of RNase H1, providing support for an R-loop mediated mechanism for these repeats. Altogether, we establish STRs as a critical component of the non-coding regulatory grammar and provide a framework for understanding how this dynamic form of genetic variation shapes gene expression.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest Nothing to report.

Figures

**Figure 1 |. An MPRA framework to evaluate the effect of STRs on gene expression.**
**(a) Schematic overview of MPRA library construction and experimental workflow**. The upper left box illustrates how STR-containing loci were chosen within 3.6kb of TSSs for human genes. A total of 3–4 variants with different STR copy numbers (each red box=1 repeat unit) plus flanking genomic regions (blue) at each locus were synthesized, with an additional filler region (orange) to ensure equal length oligonucleotides. The workflow includes (1) synthesis of STR-containing oligonucleotides plus addition of random barcodes (BCs; colored boxes) via emulsion PCR, (2) cloning of the oligonucleotide library into the pGL4.23 backbone plasmid, which is sequenced (*Sequencing round 1*) to associate barcodes with variants, (3) replacing the filler region from the plasmid pool with a minimal promoter (minP; 90° arrow) and GFP reporter (green), (4) transfection of the resulting plasmid library into the target cell line followed by DNA and RNA extraction, and (5) quantification of expression based on RNA/DNA barcode counts (*Sequencing round 2*). This process was conducted separately for moderate (hSTR1) and high barcode complexity (hSTR2) libraries. **(b) Comparison of variant-BC pairs detected by NGS with Illumina and Element.** The Venn diagram depicts captured variant-BC associations using Illumina (orange) and Element (purple) for hSTR1. The Sankey diagram expands the Venn diagram to show the concordance of BC assignments by each platform at the STR locus or variant level, since in some cases we observed agreement on the STR locus but disagreement on the copy number of the variant. **(c) Proportion of detected variant-BC pairs classified by repeat unit length.** The percentages of all variants by repeat unit length in the original oligonucleotide pool design is shown in the top box. Subsequent rows show the breakdown by categories in b. **(d) Percentage of homopolymers by copy number detected by each sequencing platform.** The x-axis shows the homopolymer copy number and the y-axis shows the percentage of all homopolymer variants with that length in the original design (light blue), vs. those captured by Illumina (orange) or Element (blue) in hSTR1. Similar analyses for hSTR2 are shown in Supplementary Fig. 1. **(e) Number of variants identified after merging Illumina and Element reads.** The line graph depicts the number of variants detected (y-axis) for the high (blue) and moderate-complexity (purple) association libraries as a function of the threshold used for the minimum number of supporting reads required for each variant-BC association (x-axis). The inset bar graph shows the number of variants detected (out of 100k synthesized) using our default threshold of ≥3 reads (dashed line in main panel). **(f) Distribution of the number of BCs detected per variant in the high-complexity (blue) and moderate-complexity (purple) libraries.** The inset bar plot shows the average per library. **(g) Correlation of RNA/DNA expression ratios between replicates.** The plot compares two replicates of the hSTR1 library. Plots comparing all replicate pairs for hSTR1–2 are shown in Supplementary Figs. 4–5.

**Figure 2 |. Characterization of STR variants with high vs. low expression in our MPRA framework.**
**(a) Schematic overview of expression results.** For each STR variant, we quantified DNA and RNA abundances (counts per million; CPM), the RNA/DNA ratio, Z-score of the ratio relative to all other variants, and repeat unit sequence. RNA abundances are represented as wavy lines. **(b) Comparison of RNA/DNA measurements across moderate- and high-complexity association libraries.** Each dot represents a single variant and is based on summing RNA and DNA read counts across each barcode associated with each variant merged across all replicates. The dashed line indicates the x=y diagonal. **(c) Distribution of expression values for each repeat.** The boxplot shows the distribution of expression Z-scores for all variants with each repeat unit. Pink and orange represent G/C rich repeat units while blue and purple represent A/T rich repeat units. **(d) Concordance of expression values for each repeat unit between the moderate complexity hSTR1 (x-axis) vs. high complexity hSTR2 (y-axis) libraries.** Expression is summarized for each repeat unit as the median Z-score across variants containing that unit. Color denotes the category of repeat unit. The dashed line indicates the x=y diagonal. Only repeat units with at least 200 variants included in the analysis are shown in **c-d**.

**Figure 3 |. Relationship between STR copy number and MPRA reporter activity across loci.**
**(a) Schematic representation of expression library barcode count analysis.** For each locus, we compute RNA/DNA ratios as well as the repeat copy number for the variant associated with each barcode across all three replicates. The left panel shows example data for a single locus. We perform a separate regression analysis between repeat copy number and RNA/DNA ratios for each locus, accounting for replicate as a covariate. **(b) Volcano plot summarizing associations between repeat copy number and reporter expression at each locus in hSTR1**. The x-axis shows the Pearson correlation between copy number and expression. The y-axis shows the log10 of the adjusted P-value. Selected loci are annotated with the hg38 locus coordinates and repeat unit. Purple=loci with a negative correlation between copy number and expression, green=loci with a positive association, grey=not significant. Representative plots are shown for two loci with positive and two with negative associations. For these, repeat copy number on the x-axes is shown relative to the reference genome (reference denoted by “0”). The y-axes denote RNA/DNA ratios. Each data point represents one barcode in one replicate and boxplots summarize distributions within each replicate. Regression slopes (β) are annotated for each locus. Blue=replicate 1, orange=replicate 2, green=replicate 3. **(c) Distribution of regression effect sizes (β) for significant loci, grouped by repeat unit type.** Only repeat units for which more than 50 loci were tested in the regression analysis are shown. Each dot represents a single locus. Boxplots summarize the distribution of values across loci for each repeat unit. Dashed orange boxes on the x-axis indicate repeat units with more than 15 loci with positive and more than 15 with negative effects. **(d) Summary of effect directions for each repeat unit.** Bars show the number of unique repeat units for loci with no significant effect (grey), negative effects (purple), positive effects (green), or both positive and negative effects (light green and orange). The inset Venn diagram indicates the overlap of repeat units across categories. Repeat units highlighted in orange in **(c)** are also emphasized in orange within the “both” bar category. **(e) Distribution of reference repeat length for selected repeat units that display both positive and negative associations (bivalent).** Boxplots show the total repeat length (in bp) in hg38 for loci with positive (green), not significant (grey), or negative (purple) effects on expression. Only loci for which at least 15 loci each with positive and negative effects were identified are shown. ns=not significant; *=1.00e-02 < p ≤ 5.00e-02; **=1.00e-03 < p ≤ 1.00e-02; ***=1.00e-04 < p ≤ 1.00e-03; ****=p ≤ 1.00e-04.

**Figure 4 |. Deep perturbation analysis of candidate regulatory STR loci.**
**(a) Overview of locus selection for the dpSTR (deep perturbation of STRs) library.** We selected 300 loci with strong effect sizes (200 positive, 100 negative) from the analysis of the moderate-complexity hSTR1 library. **(b) Overview of perturbation design.** For each locus, we modified repeat number (left), repeat unit sequence (middle), and strand orientation (top right). We also replaced each repeat with variable length random sequences (bottom right). For each locus, the repeat unit is represented in orange and flanking genomic context in grey. A list of all designed oligonucleotides for dpSTR is given in Supplementary Table 3. **(c) Overview of data derived from the dpSTR library.** After cloning and transfection, similar to hSTR1 and hSTR2 a barcode count table is generated with the associated locus and perturbation and its corresponding RNA/DNA ratio. **(d) Comparison of correlation coefficients of original and perturbation libraries.** Correlations are measured between copy number and expression for each variant using Pearson r. The plot includes loci passing QC steps in both libraries. The locus circled in dark pink with reference repeat unit CGTG is shown in detail in **e-g**. **(e–g) Representative plots for one locus illustrating repeat length versus RNA/DNA ratio across moderate-complexity hSTR1 (e), high-complexity hSTR2 (f), and deep perturbation libraries (g).** Repeat copy number on the x-axes is shown relative to the reference genome (reference denoted by “0”). The y-axes denote RNA/DNA ratios. Each data point represents one barcode in one replicate and boxplots summarize distributions within each replicate. Regression effect sizes (β), nominal P-values, and Pearson r are annotated for each locus. Blue=replicate 1, orange=replicate 2, green=replicate 3. **(h) Quantile-quantile plot comparing the distribution of regression P-values between hSTR1 and the dpSTR libraries.** Blue=hSTR1, orange=dpSTR. **(i–k) Comparison of regression effect sizes for paired perturbations at each locus.** Plots compare effect sizes when replacing the repeat sequence of each locus with random sequences (i), random sequences with matched GC content (j), or flipping the orientation (k) of each locus. Each dot represents a single locus in the dpSTR library.

**Figure 5 |. Repeat unit and flanking GC-content impact expression dynamics**
**(a) Overview of repeat unit representation in the dpSTR library.** Purple bars show the number of variants in the dpSTR library design with each repeat unit. Green bars, which are overlaid on the purple bars, represent the number of variants that were detected after quality filtering. The x-axis is shown on a log scale. **(b) Distribution of normalized expression (RNA/DNA ratios) by copy number across AAAC repeats.** All loci for which AAAC repeats were included as either the reference repeat unit or one of the perturbations are included. Color indicates replicate number (blue=replicate 1, orange=replicate 2, green=replicate 3). The normalization procedure is described in Methods. Note, some loci might be tested with the same repeat unit twice if the reference contains that locus with imperfections, since we tested with and without the imperfection (Methods). **(c) Representative relationship between copy number and expression for an example AAAC repeat.** The x-axis gives the total repeat copy number. The y-axis denotes RNA/DNA ratio. Each data point represents one barcode in one replicate. Locus id, regression slope (β), nominal P-value, and Pearson r are annotated. Blue=replicate 1, orange=replicate 2, green=replicate 3. The locus originates at chr13:27450197–27450212 (hg38). **(d-e**) are the same as **b-c** but for AAG repeats. The example locus is from chr6:143450810–143450837 (hg38). **(f) UMAP visualization of expression patterns in the dpSTR library.** Each dot represents a single repeat unit/locus pair. Points are colored by repeat unit category. The UMAP projection was manually partitioned into Group I (primarily GC-rich repeats) and Group II (primarily AT-rich repeats). **(g) Visualization of Group I by repeat unit.** Keeping the UMAP pattern of Group I from (f), the colors indicate the repeat unit sequence of each repeat unit/locus pair. Green=ACGC; blue=AAAC; pink=AGGCGC and ACGGG; yellow=AGCCG; other=grey. Points in the other category did not group by repeat unit. **(h) Expression distribution by copy number.** RNA/DNA ratios are plotted by copy number across the ACGC cluster. The boxplot shows un-normalized barcode-level ratios for variants highlighted in green in panel g. **(i) Visualization of Group II by flanking GC content.** Keeping the UMAP pattern of Group II from (f), the colors indicate the GC % of the flanking sequence. Higher flanking GC content is represented by red and lower flanking GC % is represented by blue. Subgroups are based on manual inspection of the UMAP projection. **(j) Median RNA/DNA expression ratio across repeat unit/locus pairs for each subgroup of Group II.** For each subgroup identified by flanking GC % (panel i), the plot shows the median RNA/DNA ratio for the repeat unit/locus pairs within that subgroup. Colors correspond to subgroup classification as in panel i. The inset boxplot depicts the distribution of flanking GC % for repeat unit/locus pairs within each subgroup.

**Figure 6 |. G-quadruplex formation drives R-loop sensitivity in STR regulation.**
**(a) Schematic illustrating the stranded G-quadruplex (G4) structure formed by CCCCG/CGGGG repeats in the STR region upstream of the reporter.** Green boxes denote GFP and arrows represent the promoter and transcription direction. **(b) Distribution of variant-level log(KD/WT) expression ratios.** Distributions are shown for all variants (grey), repeat unit/loci pairs with CCG/CGG repeats (blue) and repeat unit/loci pairs with CCCCG/CGGGG (orange) repeats. KD=knockdown; WT=wild type. **(c) Distribution of variant-level log(KD/WT) expression ratios stratified by repeat copy number.** Orange=CCCCG/CGGGG repeat variants; grey=all detected variants. The y=0 line is represented in black, the median for all detected variants in dashed grey, and the median for CCCCG/CGGGG repeat variants in dashed orange. **(d) Distribution of predicted G4 scores for detected variants containing the CCCCG/CGGGG repeat unit, stratified by repeat copy number.** The G4 consensus sequence is annotated in the bottom right. The dashed line shows the median predicted sum of G4 scores for each repeat copy number. **(e) Comparison of regression effect sizes for association tests between repeat copy number and expression for the WT vs. RNase H1 KD (RNH) knockdown conditions.** Each dot represents one repeat unit/locus pair. Blue=CCG/CGG repeats; orange=CCCCG/CGGGG repeats; grey=other repeats. Data points with G4 score ≥ μ + σ are highlighted by a black circle. The *y=x* line is represented as a solid black line, the best fit line for all data points is represented as a dashed black line, and the best fit line for all data points with CCCCG/GGGGC repeat units is represented as a dashed orange line. Selected loci with the CCCCG/GGGGC repeat unit (also depicted in f) are annotated according to the proximal gene. The inset bar plot shows the Pearson correlation between effect sizes stratified by repeat unit. The red dashed line shows the correlation when considering all data points. **(f) Relationship between copy number and expression for examples with outlier changes in effect size in WT vs. KD conditions.** Outliers were identified by inspection of panel e. Schematics above each plot illustrate the position of the region that was studied in the MPRA (top; orange=STR, grey=flanking, the minP and GFP are annotated as in Fig. 1) and dark blue rectangles (bottom) show gene annotations at the corresponding locus. Each data point represents a single barcode/replicate pair. Green=WT, red=RNase H1 KD. Shaded regions indicate 95% confidence intervals.

See this image and copyright information in PMC

References

1. Hannan A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet 19, 286–298 (2018). - PubMed
1. Press M. O., McCoy R. C., Hall A. N., Akey J. M. & Queitsch C. Massive variation of short tandem repeats with functional consequences across strains of Arabidopsis thaliana. Genome Res 28, 1169–1178 (2018). - PMC - PubMed
1. Vinces M. D., Legendre M., Caldara M., Hagihara M. & Verstrepen K. J. Unstable tandem repeats in promoters confer transcriptional evolvability. Science 324, 1213–1216 (2009). - PMC - PubMed
1. Verbiest M. et al. Mutation and selection processes regulating short tandem repeats give rise to genetic and phenotypic diversity across species. J Evol Biol 36, 321–336 (2023). - PMC - PubMed
1. Mirkin S. M. Expandable DNA repeats and human disease. Nature 447, 932–940 (2007). - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Systematic evaluation of the impact of promoter proximal short tandem repeats on expression

Affiliations

Systematic evaluation of the impact of promoter proximal short tandem repeats on expression

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources