Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 23;18(1):200.
doi: 10.1186/s12864-017-3566-0.

Exploratory bioinformatics investigation reveals importance of "junk" DNA in early embryo development

Affiliations

Exploratory bioinformatics investigation reveals importance of "junk" DNA in early embryo development

Steven Xijin Ge. BMC Genomics. .

Abstract

Background: Instead of testing predefined hypotheses, the goal of exploratory data analysis (EDA) is to find what data can tell us. Following this strategy, we re-analyzed a large body of genomic data to study the complex gene regulation in mouse pre-implantation development (PD).

Results: Starting with a single-cell RNA-seq dataset consisting of 259 mouse embryonic cells derived from zygote to blastocyst stages, we reconstructed the temporal and spatial gene expression pattern during PD. The dynamics of gene expression can be partially explained by the enrichment of transposable elements in gene promoters and the similarity of expression profiles with those of corresponding transposons. Long Terminal Repeats (LTRs) are associated with transient, strong induction of many nearby genes at the 2-4 cell stages, probably by providing binding sites for Obox and other homeobox factors. B1 and B2 SINEs (Short Interspersed Nuclear Elements) are correlated with the upregulation of thousands of nearby genes during zygotic genome activation. Such enhancer-like effects are also found for human Alu and bovine tRNA SINEs. SINEs also seem to be predictive of gene expression in embryonic stem cells (ESCs), raising the possibility that they may also be involved in regulating pluripotency. We also identified many potential transcription factors underlying PD and discussed the evolutionary necessity of transposons in enhancing genetic diversity, especially for species with longer generation time.

Conclusions: Together with other recent studies, our results provide further evidence that many transposable elements may play a role in establishing the expression landscape in early embryos. It also demonstrates that exploratory bioinformatics investigation can pinpoint developmental pathways for further study, and serve as a strategy to generate novel insights from big genomic data.

Keywords: Early embryogenesis; Exploratory data analysis; Pre-implantation development; Repetitive DNA; Single-cell RNA-seq; Transposons.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
a Exploratory bioinformatics investigation on gene regulation in mouse pre-implantation development. b Hierarchical clustering of gene expression during PD. Each of the 12,000 rows represents a gene. Columns correspond to samples labeled by developmental stages (E: early, M: middle, and L: late). Red indicates expression levels higher than average for the row. Expression lower than average is shown in green
Fig. 2
Fig. 2
a Enrichment of repeat elements in the promoters of genes. Original data matrix represents percentage of genes in a cluster containing a repeat in their promoter. Red indicates a certain repeat is enriched in the promoters of the gene clusters. b ZFP352 gene contains a LTR element (MT2B1) around the translation starting site. Another LTR element (RLTR26) is in the intron. These two positions correspond to known promoter regions marked as P1 and P2. Expression levels of Zfp352, Obox3, and retroelement MT2_Mm are shown in (c, d and e), respectively
Fig. 3
Fig. 3
Enrichment of transcription factor binding motifs in genes highly induced at mid 2-cell stage compared with early 2-cell stage. a Enriched motifs in the promoters of genes with highest fold-change at mid 2C. G1-G10 represents 10 groups of 500 genes sorted by fold-change. Red indicates enrichment of motifs. b The expression patterns of TFs that bind to the corresponding motifs shown in (a). c Binding motifs by Obox families in LTRs belong to the ERVL family. d Expression patterns of Obox family homeobox factors are highly regulated during PD
Fig. 4
Fig. 4
Enrichment of SINE elements in genes upregulated in the late 2-cell stage. a Average fold-change for 24 groups of 500 genes. b Relative enrichment (red) or depletion (green) of repeat elements in the promoters of genes in these groups. c FDR values derived from Chi-square tests of independency of repeat element frequency and gene groups. d Correlation of fold-change in 2C with the presence of B1 elements in gene groups. Each point in the plot represents one group of 500 genes. The vertical axis represents the percentage of genes in the group with B1_Mus2 in their promoters, while the horizontal axis represents average fold-change. e Association of repeats with fold-change during the late 2-cell stage. The average fold change of genes with one or more repeats within 2 kb on either sense or antisense strand
Fig. 5
Fig. 5
Presence of some repeats strongly correlated with activation of gene expression during 2C stage in a dosage- and distance -dependent manner. The heights of the bars indicate average fold-change in late 2-cell stage compared to early 2-cell. Significant deviations from zero are indicated by stars. The numbers on the bar represent the number of genes affected. The MT2 repeats in (a) and (b) can be MT2B, MT2B1, MT2B2, MT2_Mm, or MT2C_Mm. In (c) and (d), Alu family repeats mostly represent B1 elements. e Result from multiple linear regressions shows that SINE and LTR elements are correlated with gene expression change. f Correlation of the number of Alu family repeats in promoters and broad gene expression in various cells and tissues. Error bar shows standard error calculated from replicates. R1 and J1 are mouse embryonic stem cell lines
Fig. 6
Fig. 6
Genomic content of promoters of mESC-specific genes. a Results of k-means clustering of normal tissue gene expression show housekeeping genes (Cluster 24) and tissue-specific genes. b Gene clusters are plotted by average coverage of Alu/B1 repeats and CpG island coverage. Housekeeping genes (Cluster 24) are high in both Alu/B1 coverage and CpG island. mESC specific genes (Cluster 13) are high in Alu/B1, but low in CpG island. Gene Clusters specifically expressed in testis, intestine, and placenta are lower in both. c Devoid of CpG islands, the promoters of Pou5f1 and Nanog are enriched with Alu/B1 elements
Fig. 7
Fig. 7
a Genes with more Alus in promoter are upregulated in a dosage-dependent manner. Stars indicate significant difference from zero based on t-test. b Among gene groups defined by fold-change, highly expressed ones tend to contain more Alus in promoters in humans. c Association of gene expression with tRNA family of SINE repeats in the bovine genome. Stars indicate significant difference from zero. d DNA transposon, DNA11TA1_DR, is associated with gene upregulation during ZGA in zebrafish in a dosage-dependent manner. e Mouse genes with multiple Alu family repeats, mostly B1 elements, are associated with GO:0044224, intracellular part. Genes with L1 elements in their promoter, on the other hand, are depleted in genes related to intracellular part. f Genes with L1 elements are enriched with GPCR activity, while in genes containing Alu elements, such genes are depleted. g Mouse genes with SINE element in promoters are enriched in genes with yeast orthologs (>20% identity according to BioMart), compared to genes that does not contain such elements in promoters. h Among SINE elements, Alu family, mostly B1 elements, are enriched with genes with yeast orthologs
Fig. 8
Fig. 8
a Transcription factors (TFs) identified during PD. Upregulated and downregulated factors are shown in red and purple, respectively. Homeobox TFs are underlined. Genes coding for the TFs that are upregulated or downregulated at the same stages provide more confidence and are shown in bold. b A possible correlation between genome size and generation time across selected model organisms. c A possible correlation between genome size and longevity among animals. Taxonomical classes are represented by different shapes and colors. N = 939, R = 0.245, P = 2.4 × 10−14

Similar articles

Cited by

References

    1. Popper KR. Conjectures and refutations; the growth of scientific knowledge. New York: Basic Books; 1962.
    1. Biesecker LG. Hypothesis-generating research and predictive medicine. Genome Res. 2013;23(7):1051–1053. doi: 10.1101/gr.157826.113. - DOI - PMC - PubMed
    1. Kell DB, Oliver SG. Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays. 2004;26(1):99–105. doi: 10.1002/bies.10385. - DOI - PubMed
    1. Tukey JW. Exploratory data analysis. Massachusetts: Addison-Wesley Pub. Co; 1977.
    1. Tufte ER. The visual display of quantitative information. Cheshire, Conn. (Box 430, Cheshire 06410): Graphics Press; 1983.

Publication types

MeSH terms

LinkOut - more resources