Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 8;20(1):272.
doi: 10.1186/s12915-022-01470-5.

De novo emergence, existence, and demise of a protein-coding gene in murids

Affiliations

De novo emergence, existence, and demise of a protein-coding gene in murids

Jan Petrzilek et al. BMC Biol. .

Abstract

Background: Genes, principal units of genetic information, vary in complexity and evolutionary history. Less-complex genes (e.g., long non-coding RNA (lncRNA) expressing genes) readily emerge de novo from non-genic sequences and have high evolutionary turnover. Genesis of a gene may be facilitated by adoption of functional genic sequences from retrotransposon insertions. However, protein-coding sequences in extant genomes rarely lack any connection to an ancestral protein-coding sequence.

Results: We describe remarkable evolution of the murine gene D6Ertd527e and its orthologs in the rodent Muroidea superfamily. The D6Ertd527e emerged in a common ancestor of mice and hamsters most likely as a lncRNA-expressing gene. A major contributing factor was a long terminal repeat (LTR) retrotransposon insertion carrying an oocyte-specific promoter and a 5' terminal exon of the gene. The gene survived as an oocyte-specific lncRNA in several extant rodents while in some others the gene or its expression were lost. In the ancestral lineage of Mus musculus, the gene acquired protein-coding capacity where the bulk of the coding sequence formed through CAG (AGC) trinucleotide repeat expansion and duplications. These events generated a cytoplasmic serine-rich maternal protein. Knock-out of D6Ertd527e in mice has a small but detectable effect on fertility and the maternal transcriptome.

Conclusions: While this evolving gene is not showing a clear function in laboratory mice, its documented evolutionary history in Muroidea during the last ~ 40 million years provides a textbook example of how a several common mutation events can support de novo gene formation, evolution of protein-coding capacity, as well as gene's demise.

Keywords: CAG; D6Ertd527e; De novo; Evolution; Gene; LTR; Oocyte; Polyserine; Retrotransposon.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Murine D6Ertd527e gene characterization. a Scheme of the gene and alternative transcripts in mouse oocytes. Shown is a modified UCSC genome browser snapshot depicting expression in the oocyte, three alternative 5′ exons, and the common 3′ exon. The red rectangle depicts position of the MTD LTR insert providing the dominating oocyte-specific promoter and the first exons. The blue rectangle depicts CDS overlapping with a CAG repeat expansion. Dashed line indicates level of 250 counts per million (CPM) in RNA-seq data from C57Bl/6 fully-grown GV oocytes [27]. b Conservation of selected D6Ertd527e sequences in species having the MTD LTR insertion. The left alignment of sequences shows an MTD region carrying the AUG codon, the putative coding part of the 5′ exon, and the splice donor. The right alignment of sequences shows the splice acceptor sequence and beginning of the 3′ exon. In bold font are sequences from species where RNA-seq data are available. Colored in brown and yellow are exon and intron sequences, which were validated by RNA-sequencing. In red are depicted two notable mutations. Homologous sequence of the splice acceptor in Ondatra was not reliably determined. c Adapted timetree [28] of selected rodent species showing basic taxonomic grouping and their phylogenetic relationship. Red lines depict a part of the phylogenetic tree associated with D6Ertd527e MTD LTR in extant genomes. Grey lines lead to species, where the MTD LTR insertion was lost by deletion. The timetree was generated by the TimeTree of Life 5 resource [29]. The timescale below the tree is in millions of years ago as approximated by the TimeTree 5 tool. For more precise phylogenetic analysis of muroid species and discussion of divergence dates see [30]. d D6Ertd527e expression in mouse and hamster oocytes and zygotes. Data were compiled from published RNA-seq data from mouse [31, 32] and hamster [33] samples. Replicates were available for hamster data, n = 3, error bars = SD
Fig. 2
Fig. 2
D6Ertd527e transcript variability. a Variability of exon–intron structure of D6Ertd527e transcripts in oocytes of five different rodent species. Shown are modified UCSC genome browser snapshots depicting distribution of RNA-seq reads, level of expression and exon–intron structures inferred from analysis of spliced individual sequence reads. Position of the MTD LTR insert is indicated by red rectangles. Blue rectangles depict regions containing expanded CAG repeats. Full display of repetitive sequences from Repeatmasker is available in Additional file 1: Fig. S2. Dashed lines indicate normalized expression level in CPMs. Rattus norvegicus analysis revealed a single spliced read from > 120 million mapped reads from four independent libraries. b Distribution of AGC codons in predicted D6Ertd527e transcripts in rodent species carrying the MTD LTR insertion. In case of Cricetulus griseus, we used the most abundant transcript isoform transcribed from a promoter upstream of the MTD insert. In case of Rattus norvegicus, where the locus seems silent, we show a hypothetical transcript spliced between the conserved splice sites (Fig. 1c) to demonstrate that the putative coding sequence starting from the AUG codon in MTD is soon terminated. CPAT score [37] was calculated for predicted coding sequences represented by the thicker part of a transcript scheme. The recommended cut-off for the mouse coding probability for the CPAT release 3.00 was 0.44 [37]
Fig. 3
Fig. 3
Diversity and composition of predicted D6Ertd527e proteins. a Aminoacid composition of selected rodents inferred from transcriptome data and predicted MTD-driven D6Ertd527e transcripts (Fig. 2b). From the previously analyzed Mus musculus inbred strains [12], examples were selected to illustrate variability among the strains. Hatching in Mus spicilegus and Mus caroli reflects presence of a block of Ns in their genomic DNA. b Composition of the coding sequence in the genus Mus shows that expansion of the coding sequence stems partially from CAG repeat expansion (one such a repeat is indicated by black arrowheads) but mostly from sequence duplications of variable lengths (various duplicated segments are depicted by colored rectangles below the protein sequence)
Fig. 4
Fig. 4
Murine D6Ertd527e protein expression and structure. a Ectopically expressed C-terminally tagged D6Ertd527e protein in NIH 3T3 and HeLa cells can be detected by Western blotting. b Expression of C-terminally HA-tagged D6Ertd527e in oocytes analyzed by immunofluorescent staining and confocal microscopy. Approximately 100 000 in vitro-transcribed mRNA molecules were microinjected into mouse fully-grown GV oocytes and the protein was visualized by immunofluorescent staining with α–HA antibody (green color). DNA was stained with DAPI (blue color). Size-bar = 20 μm. c Hypothetical folding of D6Ertd527e protein predicted by AlphaFold [43]
Fig. 5
Fig. 5
D6Ertd527e knock-out analysis. a Schematic depiction of positions of designed CRISPR cleavage points. b A UCSC browser snapshot of data from RNA-seq libraries from oocytes from three wild-type mouse and three mutants. Low residual signal in the coding sequence in D6Ertd527e mutants can be explained by multimapping repetitive reads originating from other loci. c Breeding performance of matings with different combinations of genotypes. p-values were calculated with two-tailed t-tests. d PCA analysis of RNA-seq libraries suggests higher variability of wild-type controls and clustering of mutant transcriptomes. e MA plot depicting differentially expressed genes in D6Ertd527e mutant oocytes. Significantly upregulated and downregulated transcripts are shown in red and blue, respectively. The most abundant significantly changed transcripts were Gm20763 and Ccnd2
Fig. 6
Fig. 6
Phases of gene life-cycle during evolution. The upper scheme represents a locus, which does not produce any transcript. Such a locus can give rise to a lncRNA - emergence of a pol II promoter will be sufficient to produce a non-coding transcript. A promoter can emerge from a random sequence or through a solo LTR insertion. The initial transcript at the locus will likely be a lncRNA with variable exon–intron structure as mRNA processing mechanisms will recognize with variable efficiency cryptic splice donors and acceptors as well as poly(A) sites. Such a lncRNA can evolve into a protein-coding gene through recycling a protein-coding sequence from a processed pseudogene inserted into the locus

Similar articles

References

    1. Johannsen W. Elemente der exakten erblichkeitslehre. Deutsche wesentlich erweiterte ausgabe in fünfundzwanzig vorlesungen. Jena: G. Fischer; 1909. p. 534. https://www.archive.org/download/elementederexakt00joha/page/n4_w509.
    1. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, et al. What is a gene, post-ENCODE? History and updated definition. Genome Res. 2007;17(6):669–681. doi: 10.1101/gr.6339607. - DOI - PubMed
    1. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. - DOI - PMC - PubMed
    1. Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, et al. CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013;41(Database issue):D348–52. - PMC - PubMed
    1. Mushegian A. Gene content of LUCA, the last universal common ancestor. Front Biosci. 2008;13:4657–4666. doi: 10.2741/3031. - DOI - PubMed

Substances