Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2019 Nov 12;116(46):23243-23253.
doi: 10.1073/pnas.1912175116. Epub 2019 Oct 28.

Human-specific tandem repeat expansion and differential gene expression during primate evolution

Collaborators, Affiliations
Comparative Study

Human-specific tandem repeat expansion and differential gene expression during primate evolution

Arvis Sulovari et al. Proc Natl Acad Sci U S A. .

Abstract

Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes and genotype with short-read technology. We created a framework to model the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linked-read sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element-VNTR-Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissue-specific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.

Keywords: STR; VNTR; genome instability; tandem repeat; tandem repeat expansion.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement: E.E.E. is on the scientific advisory board of DNAnexus, Inc.

Figures

Fig. 1.
Fig. 1.
Phasing and assembly of STRs/VNTRs. The targeted phasing of tandem repeat sequences in 3 NHPs: chimpanzee, gorilla, and orangutan. 10x Genomics linked reads from each of the great apes were mapped to the human genome reference (GRCh38) followed by the identification of SNVs and phasing of their genotypes using Long Ranger. Next, the phased SNV genotypes were used to partition the PacBio reads of each individual into the 2 parental haplotypes followed by the assembly of 2 haplotype-partitioned contigs per locus. Materials and Methods has more details.
Fig. 2.
Fig. 2.
Human lineage-specific expansion. The largest human vs. NHP STR/VNTR copy number differences. The top 30 ab initio (no evidence of tandem repeats in other ape genomes) and HSE loci for intronic, intergenic, UTR, and exonic regions, if available, are shown in green (NHP) and blue (human) violin plots, while the solid red lines represent the number of tandem repeat copies for each locus in the human genome reference (GRCh38). GRCh38 carries a significantly shorter allele in 73, 57, 9, and 13% of the intronic, intergenic, 5′/3′ UTR, and exonic loci, respectively. In the case of exonic STRs/VNTRs, we selected additional protein-coding loci from the high-quality STR/VNTR set.
Fig. 3.
Fig. 3.
Sequence properties of human-expanded STRs/VNTRs. Ab initio and HSE tandem repeats have distinct sequence composition. (A) The relative abundance of STRs and VNTRs was broken down by position relative to a gene (i.e., intergenic, intronic, and protein coding). The lighter color corresponds to transposable element (TE)-associated tandem repeats, while the darker color corresponds to simple (i.e., non–TE-associated) tandem repeats. (B) Pie charts for the ab initio (Upper) and HSE (Lower) tandem repeats. The labels correspond to short interspersed nuclear element (SINE), long interspersed nuclear element (LINE), long terminal repeat (LTR), endogenous retrovirus (ERV), and transposons (DNA). (C) Boxplots of the percentage of GC content for all HSE and ab initio tandem repeats that are SVA elements. The total counts for the SVA elements are shown by each of the 6 subfamilies (A to F), and the ORs and P values on right-hand side were calculated using Fisher’s exact test on the observed SVA counts in our call set compared with their relative abundance in GRCh38 (i.e., 1,128 SVA_A, 848 SVA_B, 501 SVA_C, 1,546 SVA_D, 701 SVA_E, and 1,026 SVA_F). An OR > 1 represents an enrichment. The boxplots use the GC content from all of the assembled human sequences. (D) A multiple expectation maximization for motif elicitation (MEME) analysis using the sequences categorized by functional annotation with respect to the gene body. (E) A comparison of average percentage of GC content for all STR/VNTR loci in our call set. The different characters in the scatterplot correspond to A = HSE tandem repeats that are not SVA associated, B = HSE tandem repeats that are associated with SVAs, C = ab initio tandem repeats that are not associated with SVAs, and D = ab initio tandem repeats that are associated with SVA elements. The MEME motifs are shown for each of the 4 categories.
Fig. 4.
Fig. 4.
The longest pure repeat tract length. Distributions are shown using boxplots for motif sizes of ≥2 bp. The motif sequence and the corresponding number of tandem repeats are shown for the longest pure repeat tract, and the gene name is shown for intronic STRs/VNTRs only. The dotted horizontal line corresponds to the 40-tandem repeat copies threshold used to identify longest pure tracks. The numbers (n) in parentheses on the x axis correspond to the total numbers of tandem repeats observed for a given motif size. Motifs ≥11 bp were binned into 1 boxplot.
Fig. 5.
Fig. 5.
STR/VNTR sequence composition plots. The 4 loci represent STRs/VNTRs with ≥40 tandem repeat copies. The sequences from each human and NHP haplotype were colored according to their k-mer abundance (Materials and Methods). For the CHM13 sample, sequences from both CLR and HiFi assemblies have been included labeled as “CHM13” and “CHM13_HiFi,” respectively, which provide a replicate measure of sequence accuracy. (A) An STR located upstream of RNF219 is composed of 7 to 62 uninterrupted tandem copies of an AAAG expansion in humans. Three human haplotypes contain clustered AAGG interruptions, while 1 chimpanzee haplotype contains a clustered interruption of AG repeats. (B) A human-specific STR expansion is located in the intron of PHLDB2 and is composed of 18 to 70 uninterrupted repeats of AAG. Periodic interruptions of AGG exist in 3 human haplotypes and GRCh38. (C) An STR located in the intron of ADGRE2 contains 3 to 42 uninterrupted tandem repeat copies of TTTC. A cluster of a continuous tract of pure TC repeats interrupts the tetranucleotide repeat in 1 of the Puerto Rican haplotypes. (D) A human-specific VNTR expansion is located in the intron of CLCN5 and is composed of 21 to 80 tandem repeat copies of a 26-bp motif. A single interruption by a 30-bp motif that contains 3 additional adenines in position 23 occurs in the Puerto Rican and Yoruban haplotypes (gray bars).

References

    1. Sueoka N., Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc. Natl. Acad. Sci. U.S.A. 47, 1141–1149 (1961). - PMC - PubMed
    1. Jeffreys A. J., Wilson V., Thein S. L., Hypervariable ‘minisatellite’ regions in human DNA. Nature 314, 67–73 (1985). - PubMed
    1. Tautz D., Notes on the definition and nomenclature of tandemly repetitive DNA sequences. EXS 67, 21–28 (1993). - PubMed
    1. Chakraborty R., Kimmel M., Stivers D. N., Davison L. J., Deka R., Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci. Proc. Natl. Acad. Sci. U.S.A. 94, 1041–1046 (1997). - PMC - PubMed
    1. Stead J. D., Jeffreys A. J., Structural analysis of insulin minisatellite alleles reveals unusually large differences in diversity between Africans and non-Africans. Am. J. Hum. Genet. 71, 1273–1284 (2002). - PMC - PubMed

Publication types