Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 2;41(4):msae061.
doi: 10.1093/molbev/msae061.

The Diverse Evolutionary Histories of Domesticated Metaviral Capsid Genes in Mammals

Affiliations

The Diverse Evolutionary Histories of Domesticated Metaviral Capsid Genes in Mammals

William S Henriques et al. Mol Biol Evol. .

Abstract

Selfish genetic elements comprise significant fractions of mammalian genomes. In rare instances, host genomes domesticate segments of these elements for function. Using a complete human genome assembly and 25 additional vertebrate genomes, we re-analyzed the evolutionary trajectories and functional potential of capsid (CA) genes domesticated from Metaviridae, a lineage of retrovirus-like retrotransposons. Our study expands on previous analyses to unearth several new insights about the evolutionary histories of these ancient genes. We find that at least five independent domestication events occurred from diverse Metaviridae, giving rise to three universally retained single-copy genes evolving under purifying selection and two gene families unique to placental mammals, with multiple members showing evidence of rapid evolution. In the SIRH/RTL family, we find diverse amino-terminal domains, widespread loss of protein-coding capacity in RTL10 despite its retention in several mammalian lineages, and differential utilization of an ancient programmed ribosomal frameshift in RTL3 between the domesticated CA and protease domains. Our analyses also reveal that most members of the PNMA family in mammalian genomes encode a conserved putative amino-terminal RNA-binding domain (RBD) both adjoining and independent from domesticated CA domains. Our analyses lead to a significant correction of previous annotations of the essential CCDC8 gene. We show that this putative RBD is also present in several extant Metaviridae, revealing a novel protein domain configuration in retrotransposons. Collectively, our study reveals the divergent outcomes of multiple domestication events from diverse Metaviridae in the common ancestor of placental mammals.

Keywords: LTR retrotransposon; PNMA; RNA-binding; SIRH; capsid; exaptation; gene conservation; positive selection.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Full-length CA-like ORFs in the human genome. We generated a maximum-likelihood phylogenetic tree of 212 full-length CA-like ORFs in the human genome from an alignment of the full-length CA domain (238 positions). Hidden Markov Model profile searches identified CA sequences from ERVs (gray branches), and metavirus retrotransposons (red branches). Maximum-likelihood-based support values at selected nodes were calculated in FastTree2 (Price et al. 2010). The multiple sequence alignment and Newick files are available in the Supplementary Data, Supplementary Material online.
Fig. 2.
Fig. 2.
Metaviral-derived CA genes show distinct evolutionary trajectories across placental mammals. Using phylogenetic trees (supplementary figs. S2 to S7, Supplementary Material online), we assigned each metaviral-derived CA gene to one of 19 orthologous groups (labeled in red and supplementary table S2, Supplementary Material online) that are restricted to placental mammals, or to two marsupial groups (PNMA-MS1 and SIRH12). Some full-length metaviral-derived CA genes are universally retained across placental mammals as intact genes (▪), whereas others have experienced lineage-specific loss (-), pseudogenization (⊠) or duplication events (a second ▪ or ⊠ within a column). Boxes containing an “X” (i.e. ⊠) represent sequences with obvious inactivating mutations (frameshifts and/or premature stops). Gray boxes represent sequences that are truncated by gaps in the assembly. The two pseudogenes depicted for RTL1 in opossum and wallaby (dashed squares with crosses) were previously reported as gene fragments (Edwards et al. 2008). Our analysis (performed using similar methods) did not reveal convincing RTL1 homology in marsupial genomes. Most sequences are represented by individual boxes, but in cases where pseudogenized duplicates are numerous, the number of pseudogene duplicates is represented as xN (e.g. PNMA6E/F has four pseudogenized duplicates in horse). The status of SIRH12 in wallaby is denoted with a “?” to indicate uncertainty. The previously reported wallaby SIRH12 ORF (Ono et al. 2011) (with an exact match to the macEug2 version of the reference genome assembly) encodes a 107 amino acid protein, but contains a frameshifting change in a newer assembly (mMacEug1). SIRH12's ortholog in opossum is a pseudogene. The previously reported PNMA-MS2 pseudogene (Iwasaki et al. 2013) is not shown, because there are no apparently intact representatives of this sequence. The multiple sequence alignments and Newick tree files used to assign orthologous groupings are available in the Supplementary Data, Supplementary Material online.
Fig. 3.
Fig. 3.
RTL3 and RTL10 exhibit domain-specific and lineage-specific patterns of conservation. A) 364aa of the CA-containing RTL10 protein are conserved in human, some rodents, cetacea, and some bats, while the annotated mouse RTL10 protein is only 146aa and the ORF has two frameshifting 1 bp deletions with respect to human RTL10, resulting in limited protein-coding homology and the absence of a predicted CA domain. B) Simplified depictions of RTL3 showing CA homology (magenta) and PR homology in the −1 Reading frame (light green). In mouse and many other species, the two ORFs overlap and therefore likely encode a CA–PR fusion protein via a programmed ribosomal frameshift. In contrast, in human, the two ORFs do not overlap, and either encode separate proteins, or represent functional loss as seen in other simian primate genomes. Three-frame ORF analysis of RTL3 in human and mouse showing ATG start codons (short vertical bars) and stop codons (tall vertical bars). HMM homology is shown in magenta/green, and stop-free regions containing each domain are shown in gray. C)  RTL3 ORFs are subject to purifying selection for protein-coding function, with dN/dS of <1, and maximum-likelihood tests (“M0 P-value”) that support the dN/dS shown over the null model of neutral evolution. dN/dS values reported here differ from those reported in Figs. 2 and 4 due to the use of additional sequences. D) Summary of RTL10 and RTL3 status in an expanded set of mammalian genomes. The species tree is a trimmed version from Upham et al. (2019). Branches depicted using dashed lines show lineages where the RTL3 CA–PR fusion protein has been lost. Filled squares at the terminus of each branch represent intact ORFs and squares containing a cross represent sequences with obvious inactivating mutations (frameshifts and/or premature stops). Gray boxes represent sequences that are truncated due to genome assembly gaps, and “-”symbols indicate that we found no matching sequence. For RTL10, human and mouse encode different ORFs due to frameshifts: the first column (“Human”) shows that a CA-containing ORF is retained in a limited number of diverse mammalian clades, and is lost or pseudogenized in many other genomes. The second column (“Mouse”) shows that even closely related rodent genomes do not preserve the ORF found in mouse. Vertical lines with adjacent numbers show the dN/dS of these ORFs in selected clades. The multiple sequence alignments used for the PAML analyses are available in the Supplementary Data, Supplementary Material online.
Fig. 4.
Fig. 4.
Evolutionary rates of domesticated metaviral genes across placental mammals and primates. We generated in-frame alignments for each gene across placental mammals, or across primates, and analyzed the evolutionary selective pressures on each using PAML's codeml algorithm. We used the placental mammal and primate alignments to assess “overall” selective pressure, assuming a single dN/dS ratio across sites and lineages (“M0 = model 0”). Lower dN/dS ratios indicate stronger purifying selection and dN/dS = 1 is neutral evolution. To test for positive selection, we analyzed primate alignments using PAML's codeml algorithm (codon model = 2, initial dN/dS = 0.4, cleandata = 0), but this time we compared the log likelihoods of an evolutionary model that allows for a subset of residues under positive selection (model 8) with a paired model that only allows purifying and neutral selection (models 8a). For genes where the maximum-likelihood test indicates positive selection, we report the proportion of sites estimated to be under positive selection, the dN/dS of this class of sites, and the identity of sites with a >90% posterior probability of being members of the positively selected class (codeml's “BEB” Bayes Empirical Bayes method). The multiple sequence alignments used for the PAML analysis are available in the Supplementary Data, Supplementary Material online.
Fig. 5.
Fig. 5.
Four independent metavirus domestication events include structurally distinct N-terminal domains. A) A maximum-likelihood phylogenetic tree of 949 CA sequences from 24 vertebrate genomes and Repbase. 119 domesticated metaviral CA genes in mammals (dark purple, highlighted) and 830 metaviral CA-like ORFs (light pink) from selected nonmammalian vertebrate genomes: chicken (n = 1), alligator (n = 13), painted turtle (n = 32), anole lizard (n = 310), African clawed frog (n = 291) and coelacanth (n = 15) and consensus vertebrate metaviral elements from the database Repbase (n = 175), aligned across 172 positions in the CA domain. Maximum-likelihood-based support values calculated in FastTree2. Metaviral gag genes containing the PNMA N-terminal domain are phylogenetic neighbors to mammalian PMNA family genes (gray highlight, deep purple lines). The closest consensus Repbase sequence for each domestication is indicated (XT: Xenopus tropicalis, African clawed Frog; Ano: Anolis carolensis, anole lizard; Lch: Latimeria chalumnae, coelacanth; Ami: Alligator mississippiensis, American alligator), and B to E) Domain architecture (not to scale) of human Metaviridae-derived CA genes, organized according to major clades in the tree shown in panel A. Colored boxes indicate domains within each ORF identified by HMM profile searches, structural prediction, and structural homology searches (mn, Metaviral N-terminus, numbered 1 to 4 to indicate four unique N-terminal domains in the visualized metaviruses’ ntd, N-terminal domain; ca, capsid; nc, nucleocapsid; pr, protease; dutp, dUTPase; rt, reverse transcriptase; rnh, RNAaseH; int, integrase; chr, chromodomain; LTR, Long Terminal Repeat). The multiple sequence alignment and Newick files are available in the Supplementary Data, Supplementary Material online.
Fig. 6.
Fig. 6.
Predicted RBD in the PNMA family and related metaviruses. A) Illustration of a metavirus from the alligator genome colored by domain. The first ∼100 amino acids of related CAs in the human genome are predicted to form an RBD. This domain is also found independent of the CA. B) AlphaFold structural prediction of PNMA1 (purple), shown alone (left) as well as superimposed on an experimentally determined structure (gray, right) of the RBD (PDB: 7LMA) from telomerase p65 (RMSD between 41 pruned Cα atoms). C) Retention of putative RBD-only genes in placental mammals. Filled squares represent intact genes and squares containing a cross represent sequences with obvious inactivating mutations (frameshifts and/or premature stops). Gray boxes represent sequences that are truncated due to genome assembly gaps, and “-” symbols represent cases where we find no matching sequence. A known species tree is shown on the left and was obtained by pruning a whole-genome tree available via the UCSC genome browser, and D) CCDC8 translation initiates at an upstream noncanonical CTG start codon (dark purple). Aggregated data from many ribosomal profiling studies displayed via the GWIPS-viz genome browser (Michel et al. 2014) show an accumulation of initiating ribosomes at the noncanonical CTG start site (dark purple) rather than the canonical start site (light purple). The additional 70 N-terminal amino acids are highly conserved across mammals. The extended N-terminus is predicted (Alphafold) to encode a full-length RBD. The Alphafold predictions, multiple sequence alignments, and Newick files used to assign orthologous groups are available in the Supplementary data, Supplementary Material online.

Update of

References

    1. Abascal F, Zardoya R, Posada D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 2005:21(9):2104–2105. 10.1093/bioinformatics/bti263. - DOI - PubMed
    1. Abed M, Verschueren E, Budayeva H, Liu P, Kirkpatrick DS, Reja R, Kummerfeld SK, Webster JD, Gierke S, Reichelt M, et al. The Gag protein PEG10 binds to RNA and regulates trophoblast stem cell lineage specification. PLoS One. 2019:14(4):e0214110. 10.1371/journal.pone.0214110. - DOI - PMC - PubMed
    1. Acton O, Grant T, Nicastro G, Ball NJ, Goldstone DC, Robertson LE, Sader K, Nans A, Ramos A, Stoye JP, et al. Structural basis for fullerene geometry in a human endogenous retrovirus capsid. Nat Commun. 2019:10(1):5822. 10.1038/s41467-019-13786-y. - DOI - PMC - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997:25(17):3389–3402. 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Anisimova M, Nielsen R, Yang Z. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics. 2003:164(3):1229–1236. 10.1093/genetics/164.3.1229. - DOI - PMC - PubMed