Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 19;10(6):e1004418.
doi: 10.1371/journal.pgen.1004418. eCollection 2014 Jun.

Digital genotyping of macrosatellites and multicopy genes reveals novel biological functions associated with copy number variation of large tandem repeats

Affiliations

Digital genotyping of macrosatellites and multicopy genes reveals novel biological functions associated with copy number variation of large tandem repeats

Manisha Brahmachary et al. PLoS Genet. .

Abstract

Tandem repeats are common in eukaryotic genomes, but due to difficulties in assaying them remain poorly studied. Here, we demonstrate the utility of Nanostring technology as a targeted approach to perform accurate measurement of tandem repeats even at extremely high copy number, and apply this technology to genotype 165 HapMap samples from three different populations and five species of non-human primates. We observed extreme variability in copy number of tandemly repeated genes, with many loci showing 5-10 fold variation in copy number among humans. Many of these loci show hallmarks of genome assembly errors, and the true copy number of many large tandem repeats is significantly under-represented even in the high quality 'finished' human reference assembly. Importantly, we demonstrate that most large tandem repeat variations are not tagged by nearby SNPs, and are therefore essentially invisible to SNP-based GWAS approaches. Using association analysis we identify many cis correlations of large tandem repeat variants with nearby gene expression and DNA methylation levels, indicating that variations of tandem repeat length are associated with functional effects on the local genomic environment. This includes an example where expansion of a macrosatellite repeat is associated with increased DNA methylation and suppression of nearby gene expression, suggesting a mechanism termed "repeat induced gene silencing", which has previously been observed only in transgenic organisms. We also observed multiple signatures consistent with altered selective pressures at tandemly repeated loci, suggesting important biological functions. Our studies show that tandemly repeated loci represent a highly variable fraction of the genome that have been systematically ignored by most previous studies, copy number variation of which can exert functionally significant effects. We suggest that future studies of tandem repeat loci will lead to many novel insights into their role in modulating both genomic and phenotypic diversity.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Structure and measurement of tandemly repeated genes using Nanostring assays.
(a) Multiple copies of the REXO1L1 gene occur as a tandemly repeated cluster in 8q21.2. Although just 4 copies of REXO1L1 are annotated in hg18, at least seven copies of a ∼12.2 kb repeat are visible separated by a genome assembly gap (red arrows). Our studies show that this locus in fact varies from ∼110 to ∼250 diploid copies in normal humans. BLAT alignments show one of the two probes used to assay this locus that has a match to each of the annotated repeat copies, in addition to an unassembled copy on chr8_random (Table S1). Note the reduced mapability and almost complete absence of SNPs within the REXO1L1 locus. Screenshot shows a 300 kb region of hg18 (chr8:86,675,000–86,975,000). (b) CT47 is another gene present as a tandemly repeated cluster in Xq24. Measurement of CT47 copy number using two independent probes targeted to different parts of the gene show extremely high concordance (R2 = 0.98), indicating that Nanostring probe counts provide accurate measurements that are directly proportional to copy number over a wide dynamic range. (c) Direct copy number estimation for CT47 measured by Pulse Field Gel Electrophoresis (PFGE) in 12 individuals shows high concordance with Nanostring probe counts (R2 = 0.88).
Figure 2
Figure 2. High frequency of population stratification for CNV of multicopy genes.
17 of 116 (14.7%) multicopy genes show high levels of differentiation in copy number (Vst>0.2) among European, African and Asian populations. Note that probe counts on the y-axis are shown on a log2 scale.
Figure 3
Figure 3. Most multicopy genes show very low levels of linkage disequilibrium with nearby SNPs.
Correlation analysis for each of the 121 polymorphic probes targeting multicopy genes and macrosatellites with SNP markers within ±250 kb yielded a median R2 = 0.18 between the highest ranked filtered SNP and probe count. Only 3 of 116 (∼3%) multicopy genes showed an R2≥0.8 with any SNP in the three populations studied. Therefore the vast majority of tandem repeat variations lack informative tag SNPs, and thus association studies of multicopy loci require specific genotyping of each locus to gain accurate copy number information of these regions.
Figure 4
Figure 4. Variation in copy number of tandem repeats and multicopy genes is associated with alterations of local DNA methylation.
(a and c) Shown are correlation values between copy number of (a) MSat10 and (c) CCL4 with all methylation probes within ±500 kb in 118 CEU and YRI HapMap individuals. (b and d) Scatter plots showing individual level data for the methylation probes showing the strongest correlations with copy number of Msat10 and CCL4. (b) Increasing copy number of MSat10 is associated with increased methylation levels of cg14316660, (R = 0.63, permutation p<0.001). This association was replicated using a Sequenom assay targeted to the MSat10 locus (Figure 5c), confirming that it is not simply due to a technical artifact related to CNV of the underlying probe binding sites. (d) Increasing copy number of CCL4 is associated with reduced methylation levels of cg11728928 (R = −0.59, permutation p<0.001). In (a) and (c), black bars indicate the interval to which each Nanostring probe maps, CpGs showing correlation p<0.01 are indicated in blue, while the CpG showing the strongest correlation is shown as a filled blue circle and labeled with a grey arrow (with individual data plotted in (b) and (d), respectively).
Figure 5
Figure 5. Association of MSat10 copy number with neighboring gene expression and epigenetic marks.
(a) MSat10 is a 5.2 kb GC-rich tandem repeat that lies ∼4 kb distal to the gene ZFP37. Although 6 copies of this 5.2 kb repeat are present in the hg18 assembly this macrosatellite is highly polymorphic in size, varying from 4–42 copies in HapMap. ChIP-seq analysis shows the presence of histone marks characteristic of heterochromatin, such as trimethylation of histone H3 at lysine 9 and trimethylation of histone H4 at lysine 9. Screenshot from the UCSC Genome Browser shows ZFP37 (Zinc Finger Protein 37), the adjacent MSat10 repeat (red arrows), and the results of ChIP-seq analysis. (b) In 58 unrelated CEU HapMap individuals we observed an inverse correlation between copy number of the MSat10 repeat and expression level of the adjacent gene ZFP37, demonstrating suppression of ZFP37 expression associated with larger repeat sizes (c) Using a targeted Sequenom assay, we confirm that variable methylation of MSat10 is highly correlated with repeat number (R2 = 0.76, p = 4.4×10−12), showing a strong relationship between repeat size and local epigenetic state. (d) Proposed model of repeat induced gene silencing at the MSat10 locus. At low repeat numbers the region is euchromatic and the expression of the neighboring ZFP37 gene is high. However, expansions of the macrosatellite result in an accumulation of heterochromatic marks in the region, including repressive histone modifications and DNA methylation, resulting in the suppression of local gene expression. Although our model shows methylation on all MSat10 copies, our data does not exclude the possibility that on expanded MSat10 alleles DNA methylation is limited to a subset of the repeat units. Lollipops represent DNA methylation, with open circles being low and filled black circles high DNA methylation, and grey ‘Me’ bubbles represent repressive histone methylation.
Figure 6
Figure 6. REXO1L1 and TCEB3C show extreme variation in copy number among primate species.
(a) REXO1L1 is one of the most extreme examples of copy number variable genes in human, with 108–266 copies of the ∼12.2 kb repeat unit observed in the 165 HapMap individuals studied. However even more extreme variation is observed among different primates. We observed ∼450 and ∼550 copies in bonobo and chimpanzee, respectively, and copy numbers of ∼400 and ∼860 in two different gorilla individuals. In contrast while macaque has an estimated 22 copies, gibbon falls within the same range seen in human. (b) While TCEB3C ranges from 9–59 copies among HapMap individuals (mean 29 copies), all five species of primate studied show increased copy number, indicating a reduction of TCEB3C copy number specifically in the human lineage. As with REXO1L1, gorilla and chimpanzee showed the highest copy numbers, with 115 in chimpanzee and ∼270 copies in both gorillas studied.
Figure 7
Figure 7. Multicopy genes show evidence of altered selective pressures on amino acid sequence during recent primate evolution.
Density plots showing the distribution of dN/dS ratios for multicopy genes (green) compared to all RefSeq genes (red) for human versus chimpanzee. There is a significant enrichment for elevated rates of non-synonymous substitution in multicopy genes versus the genome average (p = 3.3×10−7, Kolmogorov-Smirnov test). This excess of non-synonymous amino-acid changes in recent primate evolution at multicopy genes is consistent with either reduced selective constraint and/or selection for proteins with altered function. Similar results are obtained when comparing human with orangutan and macaque (Figure S5).

References

    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921. - PubMed
    1. Warburton PE, Hasson D, Guillem F, Lescale C, Jin X, et al. (2008) Analysis of the largest tandemly repeated DNA families in the human genome. BMC Genomics 9: 533. - PMC - PubMed
    1. Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, et al. (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16: 1182–1190. - PMC - PubMed
    1. Sharp AJ, Itsara A, Cheng Z, Alkan C, Schwartz S, et al. (2007) Optimal design of oligonucleotide microarrays for measurement of DNA copy-number. Hum Mol Genet 16: 2770–2779. - PubMed
    1. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, et al. (2009) Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41: 1061–1067. - PMC - PubMed

Publication types

LinkOut - more resources