Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 18:11:e76257.
doi: 10.7554/eLife.76257.

Mosaic cis-regulatory evolution drives transcriptional partitioning of HERVH endogenous retrovirus in the human embryo

Affiliations

Mosaic cis-regulatory evolution drives transcriptional partitioning of HERVH endogenous retrovirus in the human embryo

Thomas A Carter et al. Elife. .

Abstract

The human endogenous retrovirus type-H (HERVH) family is expressed in the preimplantation embryo. A subset of these elements are specifically transcribed in pluripotent stem cells where they appear to exert regulatory activities promoting self-renewal and pluripotency. How HERVH elements achieve such transcriptional specificity remains poorly understood. To uncover the sequence features underlying HERVH transcriptional activity, we performed a phyloregulatory analysis of the long terminal repeats (LTR7) of the HERVH family, which harbor its promoter, using a wealth of regulatory genomics data. We found that the family includes at least eight previously unrecognized subfamilies that have been active at different timepoints in primate evolution and display distinct expression patterns during human embryonic development. Notably, nearly all HERVH elements transcribed in ESCs belong to one of the youngest subfamilies we dubbed LTR7up. LTR7 sequence evolution was driven by a mixture of mutational processes, including point mutations, duplications, and multiple recombination events between subfamilies, that led to transcription factor binding motif modules characteristic of each subfamily. Using a reporter assay, we show that one such motif, a predicted SOX2/3 binding site unique to LTR7up, is essential for robust promoter activity in induced pluripotent stem cells. Together these findings illuminate the mechanisms by which HERVH diversified its expression pattern during evolution to colonize distinct cellular niches within the human embryo.

Keywords: HERVH; embryonic stem cells; endogenous retroviruses; genetics; genomics; human; primate evolution; transposable elements.

PubMed Disclaimer

Conflict of interest statement

TC, MS, GD, JC, JR, CF No competing interests declared

Figures

Figure 1.
Figure 1.. Phylogenetic analysis of LTR7 sequences.
(A) Unrooted phylogeny of all solo and 5’ LTR7 subfamilies from LTR7, 7b, 7c, and 7y. Colors denote clades consisting of previously annotated 7b, 7c, and 7y with >95% concordance. (B) Unrooted phylogeny of all solo and 5’ LTR7 sequences. All nodes with ultrafast bootstraps (UFbootstraps) > 0.95, >10 member insertions, and >1.5 substitutions/100 bp (~6 base pairs) are grouped and colored (see Materials and methods). Previously listed consensus sequences from 7b/c/y were included in the alignment and are shown in black. (C) Median-joining network analysis of all LTR7 and related majority rule consensus sequences. Ticks indicate the number of SNPs at non-gaps between consensus sequences. The size of circles is proportional to the number of members in each subfamily. Only LTR7 insertions that met filtering requirements (see Materials and methods) are included while 7b/c/y counts are from dfam.
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Phylogenetic tree from LTR7 reverse transcriptase (RVT) domains.
The consensus RVT domain was used to extract RT from individual LTR7 insertions. Relative node ages are shown at the end of the terminal branch lengths and are colored corresponding to their subfamily of origin. Only nodes with ultrafast bootstrap support >0.95 are shown. All displayed nodes are considered high confidence.
Figure 2.
Figure 2.. Age analysis of LTR7 subfamilies.
(A) Proportion of a given subfamily that have 1:1 orthologous insertions between human and other primate species. LTR7 subfamilies are from the tree in Figure 1B; 7b/c/y subfamilies are from RepeatMasker annotations. Non-human primates are spaced out on the X axis in accordance with their approximate divergence times to the human lineage. (B) Terminal branch lengths of all LTR7 insertions from Figure 1B. Groups with similar liftover profiles were merged for statistical testing (see Materials and methods). Differences with padj <1e-15 are denoted with * (Wilcoxon rank-sum test with Bonferroni correction).
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Gross reciprocal liftover from human to non-human primate (NHP).
The absolute number of human endogenous retrovirus type-H (HERVH) long terminal repeats (LTRs) that are in human is listed at MYA 0. Those that reciprocally liftover to chimpanzee, gorilla, orangutan, gibbon, and rhesus macaque are plotted as dots left to right with their relative divergence ages plotted as MYA (bottom). Points are connected with lines, where the slope approximates the rate of propagation of each subfamily between speciation events. This is a sister plot to main text Figure 2A; the only difference is instead of proportion of reciprocal liftover, this is the gross number.
Figure 3.
Figure 3.. Phyloregulatory analysis of LTR7.
(A) '“Phyloregulatory’ map of LTR7. The phylogenetic analysis to derive the circular tree is the same as for the tree in Figure 1B but rooted on the 7b consensus. Subfamilies defined in Figure 1 are denoted with dotted colored tips. Positive regulatory calls for each insertion are shown as tick marks of different colors and no tick mark indicates a negative call. All marks are derived from embryonic stem cell (ESC) except for ZNF90 and ZNF534, which are derived from ChIP-exo data after overexpression of these factors in HEK293 cells (see Materials and methods). (B) Heatmap of major activation and repression profiles. Proportions indicate the proportion of each group positive for a given characteristic. Trees group LTR7 subfamilies on regulatory signature, not sequence similarity. Asterisks denote statistical differences between given group and 7up1 (padj < 0.05 Wilcoxon rank-sum with Bonferroni correction). (C) Heatmap done in similar fashion to B but for repression marks. (D) Heatmap of transcribed (>2 fpkm) and untranscribed 7up1/2 (<2 fpkm) and all 7d1/2. Red asterisks denote statistical differences between 7d1/2 and 7up1 (padj < 0.05 chi-square Bonferroni correction). White asterisks denote differences between transcribed and untranscribed LTR7up.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Heatmap and aggregate signal of two replicates of whole-genome bisulfite sequencing (WGBS), GRO-seq (plus strand), H3K9me3, and SOX2 in H1 cells.
Visible window is 500 bp upstream of the start of 5’ (or solo) long terminal repeats (LTR) and extends to 8 kb after this point (~3 kb past 3’ LTR or ~7.5 kb past solo into genomic DNA). Loci are sorted on GRO-seq data in the visible window.
Figure 3—figure supplement 2.
Figure 3—figure supplement 2.. Violin plots visualize the density and distribution of ZNF534 in early embryogenesis and ESCs at passages 0 and 10.
Y axis is (log TPM [transcripts per million]).
Figure 4.
Figure 4.. Modular block evolution of LTR7 subfamilies.
(A) A multiple sequence alignment of LTR7 subfamily consensus sequences. The phylogenetic topology from Figure 1 is shown on the left. The MSA is broken down into sequence blocks (red lines) with differential patterns of relationships. (B) Parsimony trees from A sequence blocks. Subfamilies whose blocks do not match the overall phylogeny are highlighted in red. Bootstrap values > 0 are shown. (C) Blastn alignment of LTR7up1 block 2a and 2b. (D) A multiple sequence alignment of majority rule consensus sequences from each LTR7 subfamily detailing shared structure. Blocks show aligned sequence; gaps represent absent sequence. Colored sections identify putative phylogeny-breaking events. Recombination events whose directionality can be inferred (via aging) are shown with blocks and arrows on the cladogram. Recombination events with multiple possible routes are denoted with ‘?’. The deletion of 2b is denoted on the cladogram with a red ‘X’; the duplication of 2a is denoted with two red rectangles.
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Parsimony tree of all consensus human endogenous retrovirus type-H (HERVH) long terminal repeats (LTR) blocks 2a and 2b.
Sequence blocks 2a and 2b for all subfamilies were aligned using MUSCLE and the output was used to generate a parsimony tree in SEAVIEW (see Materials and methods).
Figure 5.
Figure 5.. Expression profile of LTR7 subfamilies in human preimplantation embryonic lineages and embryonic stem cells (ESCs).
The solid dots and lines encompassing the violins represent the median and quartiles of single cellular RNA expression. The color scheme is based on embryonic stages, defined as maternal control of early embryos (oocytes, zygote, 2 cell and 4 cell stage), EGA (8 cell and morula), inner cell mass (ICM), trophectoderm (TE), epiblast (EPI), and primitive endoderm (PE) from the blastocyst, and ESCs at passages 0 and 10.
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Majority rule consensus sequences used for remasking of human genome.
Consensus sequences for LTR7 subfamilies were generated from Figure 1 (main text – see Materials and methods). Here, we show which sequences were included for the LTR7B, LTR7Y, and LTR7C (LTR7C1+ LTR7C2) consensus sequence generations (purple circles, black labels) using the tree from Figure 1A. Previously annotated LTR7B/C/Y and LTR7 subfamily subdivision annotations from Figure 1 (main text) are shown with colored node ages. The LTR7u2 sister clade was not included in the LTR7Y consensus generation.
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. LTR7 subfamily expression in ‘primed’ and ‘naïve’ cell lines.
(Upper panel) Jittered boxplot representing the relative average expression difference (log2 FC) of LTR7 subgroups in naïve and primed cells. Every dot is the log2 FC of one replicate of naïve cells relative to primed cells of the corresponding subfamily. (Middle panel) Heatmap showing the differential expression of distinctive LTR7 subgroups (log2 FC) in individual naïve cells compared with their respective primed cells. (Lower panel) Bar plot showing the average expression across samples of each LTR7 subgroup.
Figure 6.
Figure 6.. An 8 bp insertion, SOX2/3 binding site necessary for LTR7up transcription.
(A) (log) p-values > 500 for HOMER motifs enriched in 7up1 insertion’s sequence blocks vs. the same blocks from other insertions from other human endogenous retrovirus type-H (HERVH) subfamilies are shown. (B) Line plots show SOX2 ChIP-seq signal at LTR7 subfamily loci in human embryonic stem cells (ESCs). Signal from genomic loci was compiled relative to position 0. The 7up/u1 8 bp insertion position is shown with a dotted line. Region 2b harboring SOX2/3 transcription factor binding site (TFBS) is detailed below. (C) Scheme of DNA fragments cloned into pGL3-basic vector driving luciferase gene expression (LUC) with identified SOX2/3 motifs. Three constructs were analyzed: Entire LTR7up (7up1), 7d1/2 consensus sequence (approximate ancestral sequence for all LTR7d) and LTR7up with eight nucleotides deleted (LTR7up (Δ8bp – AAAAGAAG)) (see panel B). (D) Normalized relative luciferase activity of tested fragments compared to LTR7 down; n = 4 measurements; bars, means across replicates; error bars, standard deviation of the mean, dots, individual replicates.
Figure 6—figure supplement 1.
Figure 6—figure supplement 1.. Transcription factor (TF) expression in the preimplantation embryo and TF binding in embryonic stem cells.
(A) Heatmap showing averaged and scaled expression (log of normalized CPM values) of human TFs with enriched motifs in LTR7-up1 subgroup and those are expressed ([log TPM (transcripts per million)] > 1) at least in one cell of early human embryos (A-C). Color scheme is based on z-score distribution, from –4 (light blue) to 4 (dark red). (B) The merged dotplot and heatmap demonstrates overrepresentation analysis of pluripotency TFs binding in distinctive LTR7 subgroups. Color intensity corresponds to log scaled enrichment of the analyzed ChIP-seq peaks within LTR7 subgroups compared to randomized 500 bps bins of human genome. Dots represent the significance (−log10 [FDR]) of ChIP-seq enrichment within a given LTR7 subgroup (see Materials and methods). (C) Violin plots detail the spread of expression of SOX2/3 in the preimplantation embryo. Y axis is log2 TPM.
Figure 6—figure supplement 2.
Figure 6—figure supplement 2.. Heatmap showing read pileup from GRO-seq plus strand and SOX2 ChIP-seq on 7up and 7u1 insertions.
Insertions are ranked from highest to lowest in SOX3 motif (HOMER – see Materials and methods) strength. Visible window is -500 bp to +8kb from start of 5' or solo LTR. SOX2 signal intensity over long terminal repeats (LTRs - not entire visible window) with this SOX3 motif and those without was not significantly different (Wilcoxon rank-sum test with Bonferroni correction).
Figure 6—figure supplement 3.
Figure 6—figure supplement 3.. Violin plots detailing the distribution of transcription factor (TF) expression shown in Figure 6—figure supplement 1 in early human embryos and embryonic stem cells.
Figure 7.
Figure 7.. Model of LTR7 subfamily evolution.
Estimated LTR7 subfamily transpositional activity in million years ago (mya) is listed with corresponding approximate primate divergence times (bottom). The positioning and duration of transpositional activity are based on analysis from Figure 3B. The gray connections between subfamilies indicate average tree topology which is driven by overall pairwise sequence similarity. Dashed lines indicate likely recombination events which led to the founding of new subfamilies. Stage-specific expression profiles from Figure 5A are detailed to the right of each corresponding branch.
Author response image 1.
Author response image 1.

References

    1. Babaian A, Mager DL. Endogenous retroviral promoter exaptation in human cancer. Mobile DNA. 2016;7:24. doi: 10.1186/s13100-016-0080-x. - DOI - PMC - PubMed
    1. Bailly-Bechet M, Haudry A, Lerat E. “One code to find them all”: a perl tool to conveniently parse RepeatMasker output files. Mobile DNA. 2014;5:13. doi: 10.1186/1759-8753-5-13. - DOI
    1. Bandelt HJ, Forster P, Röhl A. Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution. 1999;16:37–48. doi: 10.1093/oxfordjournals.molbev.a026036. - DOI - PubMed
    1. Bannert N, Kurth R. Retroelements and the human genome: new perspectives on an old relation. PNAS. 2004;101 Suppl 2:14572–14579. doi: 10.1073/pnas.0404838101. - DOI - PMC - PubMed
    1. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. - DOI - PMC - PubMed

Publication types