Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 5;16(2):e0320024.
doi: 10.1128/mbio.03200-24. Epub 2024 Dec 23.

The protein structurome of Orthornavirae and its dark matter

Affiliations

The protein structurome of Orthornavirae and its dark matter

Pascal Mutz et al. mBio. .

Abstract

Metatranscriptomics is uncovering more and more diverse families of viruses with RNA genomes comprising the viral kingdom Orthornavirae in the realm Riboviria. Thorough protein annotation and comparison are essential to get insights into the functions of viral proteins and virus evolution. In addition to sequence- and hmm profile‑based methods, protein structure comparison adds a powerful tool to uncover protein functions and relationships. We constructed an Orthornavirae "structurome" consisting of already annotated as well as unannotated ("dark matter") proteins and domains encoded in viral genomes. We used protein structure modeling and similarity searches to illuminate the remaining dark matter in hundreds of thousands of orthornavirus genomes. The vast majority of the dark matter domains showed either "generic" folds, such as single α-helices, or no high confidence structure predictions. Nevertheless, a variety of lineage-specific globular domains that were new either to orthornaviruses in general or to particular virus families were identified within the proteomic dark matter of orthornaviruses, including several predicted nucleic acid-binding domains and nucleases. In addition, we identified a case of exaptation of a cellular nucleoside monophosphate kinase as an RNA-binding protein in several virus families. Notwithstanding the continuing discovery of numerous orthornaviruses, it appears that all the protein domains conserved in large groups of viruses have already been identified. The rest of the viral proteome seems to be dominated by poorly structured domains including intrinsically disordered ones that likely mediate specific virus-host interactions.

Importance: Advanced methods for protein structure prediction, such as AlphaFold2, greatly expand our capability to identify protein domains and infer their likely functions and evolutionary relationships. This is particularly pertinent for proteins encoded by viruses that are known to evolve rapidly and as a result often cannot be adequately characterized by analysis of the protein sequences. We performed an exhaustive structure prediction and comparative analysis for uncharacterized proteins and domains ("dark matter") encoded by viruses with RNA genomes. The results show the dark matter of RNA virus proteome consists mostly of disordered and all-α-helical domains that cannot be readily assigned a specific function and that likely mediate various interactions between viral proteins and between viral and host proteins. The great majority of globular proteins and domains of RNA viruses are already known although we identified several unexpected domains represented in individual viral families.

Keywords: Orthornaviria; RNA virus; novel protein domains; protein structure prediction; proteome.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig 1
Fig 1
Frequencies of EMRV profile annotated vs ICTV exemplar-annotated proteins across virus families (A) Schematic of a prototype pangenome per virus family with 100 contigs which are either incomplete or complete but all contain the RdRp (blue), and some also contain helicase (Hel, green), a single jelly-roll capsid (SJR_capsid, pink), an unannotated domain (Dom_1, CUD, gray) or an ORFan (ORFan_1, white). (B) A number of unique annotated domains per virus family (“# of fcts/virfam”) for all virus families in the EMRV set (“EMRV: all”), of the 98 named virus families with a corresponding family in the ICTV exemplar set (“EMRV: 98 virfam”) and of the ICTV exemplar set (“ICTV: 98 virfam”). (C) A number of unique annotations based on profile comparison per virus family across all 498 families. (D) Same as panel C but with harmonized functions (e.g., combining all Helicase-related labels as “‘Hel”). (E) Number of unique annotations within the ICTV exemplar virus families based on nvpc profile db comparison. (F) Harmonized functions (e.g., “capsid” represents the functional tags “nucleoprotein,” “SJR capsid,” “core,” and others assigned to capsid and nucleocapsid proteins) across the ICTV exemplar virus families as in panel E together with proteins which are annotated in GenBank but not in nvpc. (C–F) Inset shows frequencies for all functional domains that are present in at least two families.
Fig 2
Fig 2
Secondary structure assignments for CUDs and ORFans Psique-based secondary structure assignments are shown for all CUDs (A) and ORFans (B) with a mean plddt ≥70. α-helical types in the blue color range, β-strand and α-helical in the red color range, all-β in brown, and other in gray.
Fig 3
Fig 3
Overview of domains and ORFans of interest. (A) A number of representative COI (conserved unannotated domain of interest) structures binned as follows: (i) no overlap with present annotation in genome; (ii) conflict: there is a present annotation that slightly overlaps with the provisional CUD annotation; (iii) mixed: members of a CUD cluster had substantially different provisional PSI-BLAST annotations; (iv) extension: the provisional psi-blast annotation of a CUD extended the annotation of an existing profile-based annotated domain; (v) generic fold: based on Dali results, the fold is a single helix, HTH, a beta-hairpin or disordered. Categories i–iii were analyzed further. (B) Schematic of the neighborhood analysis. Homologous multidomain proteins or polyproteins of neighboring genomes were aligned, and protein annotations were mapped on the alignment. If a putative COI region overlapped a confident annotation, it was considered annotated. (C) Number of COIs that were considered annotated as a result of the neighborhood analysis (bottom bar) and results of the semi-manual examination of COIs. New: a COI representative with a predicted structure not reported previously for the given virus family. Refine: Dali results pointed to a refinement of the annotation as the structure/function was already reported for other members in this virus family. Unclear: a high-confidence model was obtained for a COI but Dali hits were inconclusive (mainly, alpha/beta domains). Generic/ low z: the structure is too generic to produce meaningful Dali hits (e.g., an alpha helix with a small beta-hairpin) or the Dali z-score was not significant (below 4). (D) Results of the semi-manual check of OOIs. Binning is as in C. (E) Rarefaction curve of distinct domains as a function of the number of sampled genome clusters (leaves). The blue line represents a mean of 30 bootstraps and gray area shows the range of unique domains at each sampling step (step size: 50 genome clusters).
Fig 4
Fig 4
Phytoreovirus core-P7 dsRNA-binding domain with a kinase fold in orthornaviruses. (A) Superposition of Picobirnaviridae 5′ ORFan (colored by plddt score, Walker A motif is shown in black, position of degraded Walker B motif is shown in gray) with adenylate kinase from Methanococcus igneus (6psp, green, Walker A motif shown in magenta, Walker B motif is shown in cyan, z-score 9.1). (B) Phylogenetic distribution of contigs encoding P7-dsRBD (red branches and asterisks), lysozyme (orange and asterisk), and capsid protein (purple) 5′ of the RdRp within Picobirnaviridae. Blue color indicates contigs containing less than 180 nt in front of the RdRp ORF (likely incomplete). (C) Phylogenetic tree based on a structure-guided alignment of viral P7-dsRBD domains found in different virus families by structure comparison (Picobirnaviridae) or profile comparison (EMRV set) with structurally similar kinases (z-scores 6–11; order as in tree: Dephospho-CoA kinase from Thermotoga maritima [2grjA]; chloramphenicol phosphotransferase from Streptomyces venezuelae [1qhyA]; adenylate kinase 3 from Homo sapiens [6zjdA]; adenylate kinase 5 from Homo sapiens [2bwjA]; uridylate kinase from Saccharomyces cerevisiae [1ukyA]; atypical mammalian nuclear adenylate kinase hCINAP from Homo sapiens [3iimA]; probable kinase from Leishmania major Friedlin [1y63A]; adenylate kinase from Methanococcus igneus [6pspA]; shikimate kinase from Arabidopsis thaliana [3nwjA]; shikimate kinase from Acinetobacter baumannii [4y0aA]; shikimate kinase from Erwinia chrysanthemi [1shkA]; APE1195 from Aeropyrum pernix K1 [2yvuA]; ATP sulfurylase from Penicillium chrysogenum [1i2dA]; and APS kinase CysC from Mycobacterium tuberculosis [4bzqA]). Branches of viral P7-dsRBD are colored in blue, and those of cellular kinases are colored in black.
Fig 5
Fig 5
Inactive RNA-binding domain fold in Hepeviridae. (A) Superposition of likely inactive RNA-binding domain (RBD)-fold found in Hepeviridae (colored by plddt score) with RBD 2 from A. thaliana protein HYL1 (pdb 3adj, green, z-score 7.5). (B) Phylogenetic tree of Hepeviridae RdRp. Branches containing the OOI with RBD-fold are colored in red. Branches with no coding capacity after the capsid are colored in blue. (C) Genome maps for representative Hepeviridae members encoding (or not) for the OOI. Annotations are based on profile analysis (1) and GenBank annotation (NC 018382). Protein domains: RdRp, RNA-directed RNA polymerase, Hel, helicase, CP, capsid protein, Pro, protease, ORF3, Hepeviridae ORF 3, Other: additional Hepeviridae domains, OOI: ORFan of interest.
Fig 6
Fig 6
Winged helix-turn-helix domain and movement protein ORFans in a novel viral family. (A) Superposition of wHTH domain identified in f.0008.base-Polycipi (plddt score colored) with mouse HOP2 DNA-binding wHTH domain (2mh2, green, z-score: 11.1). (B) Superposition of predicted movement protein encoded by f.0008 members (plddt colored) with an annotated movement protein domain from Betaflexiviridae (AlphaFold2 modeled, green, MP_30K, z-score: 7). (C) Phylogenetic distribution of contigs encoding only the predicted movement protein (pink) or both movement protein and wHTH (red). Blue branches indicate contigs with less than 180 nt after the capsid encoding ORF which are likely incomplete. (D) Representative genome maps for members of f.0008 carrying the respective ORFs of interest (OOI). Protein domains: RdRp, RNA-directed RNA polymerase; Hel, helicase; CP, capsid protein; Pro, protease; OOI, ORF of interest (wHTH [red] or MP [pink]).
Fig 7
Fig 7
Endonuclease domain in Marnaviridae. (A) Superposition of representative Marnaviridae endonuclease domains (plddt score colored) with the C-terminal domain of endonuclease EndoMS (pdb 5gkh, aa 127-end, protein: green; DNA: light sea green; z-score 8.2). Catalytic residues of the nuclease are highlighted in pink for EndoMS (K181, E179, and D156A from left to right) and in black for Marnaviridae endonuclease (K69, E67, and D54). D156A is experimentally mutated in 5gkh to obtain the structure with uncleaved DNA. (B) Structure-guided alignment between Marnaviridae endonuclease and the four Dali top hits: endonuclease EndoMS (5gkh, Archaea, Thermococcus kodakarensis KOD1, z-score 8.2), NucS (2vld, Archaea, Pyrococcus abyssi, z-score 7.3), Holiday junction resolvase Hjc (1ipi, Archaea, Pyrococcus furiosus, z-score 6.7) and nicking endonuclease Nt.BspD6I (5liq, Bacteria, Bacillus sp., z-score 6.1). (C) Genome maps of representative (nearly) complete Marnaviridae members from reference and ICTV exemplar (NC_007522) which either contain (top two) or lack (bottom two) the endonuclease domain. Annotations are based on profile analysis (1) and GenBank annotation (NC_007522). Protein domains: RdRp, RNA-directed RNA polymerase, Hel, helicase, CP, capsid protein, Pro, protease, Pro-Co: protease cofactor_calici-como32k-like, Zbd, Zn-binding domain, COI: unannotated domain of interest (Marnaviridae endonuclease), other: other unannotated domain. (D) Phylogenetic tree of Marnaviridae RdRps (1); clades in which each leaf represents at least one contig that encodes an endonuclease are shown in red. Blue branches indicate contigs with less than 60 aa left unannotated C-terminal of the RdRp domain in the polyprotein.
Fig 8
Fig 8
Hydrolase fold in Solemoviridae. (A) Superposition of putative hydrolase identified in Solemoviridae (colored by plddt score) with Arabidopsis thaliana SOBER1 deacetylase (pdb 6avw, green, z-score 7.0). (B) Phylogenetic tree in which leaves representing members encoding the putative hydrolase are colored red and leaves representing genomes with no coding capacity at the 3′ end for the putative hydrolase are colored blue (likely incomplete genomes). (C) Representative genome maps of Solemoviridae members. Annotations are on profile analysis (1) and GenBank annotation (NC_002766; start of RdRp encoding ORF at proposed frameshift leading to protease-RdRp polyprotein). Protein domains: RdRp, RNA-directed RNA polymerase, CP, capsid protein, Pro, protease, OOI, ORFan of interest, other: Solemoviridae-specific proteins p0 and p5.

Similar articles

Cited by

References

    1. Neri U, Wolf YI, Roux S, Camargo AP, Lee B, Kazlauskas D, Chen IM, Ivanova N, Zeigler Allen L, Paez-Espino D, et al. . 2022. Expansion of the global RNA virome reveals diverse clades of bacteriophages. Cell 185:4023–4037. doi:10.1016/j.cell.2022.08.023 - DOI - PubMed
    1. Zayed AA, Wainaina JM, Dominguez-Huerta G, Pelletier E, Guo J, Mohssen M, Tian F, Pratama AA, Bolduc B, Zablocki O, et al. . 2022. Cryptic and abundant marine viruses at the evolutionary origins of earth’s RNA virome. Science 376:156–162. doi:10.1126/science.abm5847 - DOI - PMC - PubMed
    1. Edgar RC, Taylor B, Lin V, Altman T, Barbera P, Meleshko D, Lohr D, Novakovsky G, Buchfink B, Al-Shayeb B, Banfield JF, de la Peña M, Korobeynikov A, Chikhi R, Babaian A. 2022. Petabase-scale sequence alignment catalyses viral discovery. Nature New Biol 602:142–147. doi:10.1038/s41586-021-04332-2 - DOI - PubMed
    1. Lauber C, Seitz S. 2022. Opportunities and challenges of data-driven virus discovery. Biomolecules 12:1073. doi:10.3390/biom12081073 - DOI - PMC - PubMed
    1. Bukhari K, Mulley G, Gulyaeva AA, Zhao L, Shu G, Jiang J, Neuman BW. 2018. Description and initial characterization of metatranscriptomic nidovirus-like genomes from the proposed new family abyssoviridae, and from a sister group to the coronavirinae, the proposed genus alphaletovirus. Virology (Auckl) 524:160–171. doi:10.1016/j.virol.2018.08.010 - DOI - PMC - PubMed