. 2023 Apr 4;40(4):msad060.

doi: 10.1093/molbev/msad060.

Identification of RNA Virus-Derived RdRp Sequences in Publicly Available Transcriptomic Data Sets

Ingrida Olendraite¹, Katherine Brown¹, Andrew E Firth¹

Affiliations

PMID: 37014783
PMCID: PMC10101049
DOI: 10.1093/molbev/msad060

Identification of RNA Virus-Derived RdRp Sequences in Publicly Available Transcriptomic Data Sets

Ingrida Olendraite et al. Mol Biol Evol. 2023.

. 2023 Apr 4;40(4):msad060.

doi: 10.1093/molbev/msad060.

Authors

Ingrida Olendraite¹, Katherine Brown¹, Andrew E Firth¹

Affiliation

¹ Division of Virology, Department of Pathology, Addenbrookes Hospital, University of Cambridge, Cambridge, United Kingdom.

PMID: 37014783
PMCID: PMC10101049
DOI: 10.1093/molbev/msad060

Abstract

RNA viruses are abundant and highly diverse and infect all or most eukaryotic organisms. However, only a tiny fraction of the number and diversity of RNA virus species have been catalogued. To cost-effectively expand the diversity of known RNA virus sequences, we mined publicly available transcriptomic data sets. We developed 77 family-level Hidden Markov Model profiles for the viral RNA-dependent RNA polymerase (RdRp)-the only universal "hallmark" gene of RNA viruses. By using these to search the National Center for Biotechnology Information Transcriptome Shotgun Assembly database, we identified 5,867 contigs encoding RNA virus RdRps or fragments thereof and analyzed their diversity, taxonomic classification, phylogeny, and host associations. Our study expands the known diversity of RNA viruses, and the 77 curated RdRp Profile Hidden Markov Models provide a useful resource for the virus discovery community.

Keywords: Orthomyxoviridae; RNA virus; RdRp; pHMM; splicing; virus discovery.

PubMed Disclaimer

Figures

<sc>Fig.</sc> 1. — **Fig. 1.**
Various metrics of identified sequences. (A) Numbers of identified RdRp-encoding ORFs (ref, nr/nt, and TSA) and their lengths after trimming to the RdRp core (see main text) and removing duplicate identical sequences. (B) Percentage increase in the number of RdRp clusters as a function of trimmed RdRp core fragment length (x-axis) and clustering identity threshold, upon adding the TSA-derived sequences to the nr/nt and ref sequences. The y-axis shows the percentage increase in clusters after using different CDHIT (Li and Godzik 2006; Fu et al. 2012) identity thresholds (50%, 70%, 90%, and 100%, as indicated) for nr/nt + TSA sequences compared with nr/nt sequences alone. (C) Numbers of sequences identified in each cluster at different pairwise amino acid identity thresholds. Duplicate identical sequences were removed. Identities were calculated via pairwise alignment in Biopython (Cock et al. 2009, see Materials and Methods) and dividing the number of identical aligned residues by the shorter sequence length.

<sc>Fig.</sc> 2. — **Fig. 2.**
Total number of classified sequences in each group of classified sequences which had 100 or more sequences (cluster numbers C1–C28; see supplementary fig. S2, Supplementary Material online for all clusters C1–C60). Blue (darker, on the left), nr/nt and ref sequences; pink (lighter, on the right), TSA sequences; +s, +ssRNA viruses; −s, −ssRNA viruses; and ds, dsRNA viruses.

<sc>Fig.</sc> 3. — **Fig. 3.**
Distributions of RdRps and host species across TSA data sets. Only nonidentical RdRp core sequences were used (i.e., discarding duplicate 100% identical RdRp sequences within each classified pHMM group, including any identical to nr/nt or ref sequences, leaving the longest representative). No RdRps were detected in the eight bacteria TSA data sets with our pHMMs. (A) RdRp counts per host type. (B) TSA data set counts per host type. (C) Mean number of RdRps per host species. Note that within metagenomics samples, the majority of “species” were named “gut metagenome.” (D) Numbers of unique TSA data set host species, grouped by host type.

<sc>Fig.</sc> 4. — **Fig. 4.**
Numbers of unique putative host species (A) and numbers of TSA-derived RdRps (B) for different classified pHMM clusters, separated by host species category as indicated in the key. Duplicate 100% identical RdRp core sequences were removed (as in fig. 5). Asterisk (*)—note that the majority of the metagenomic data sets are labeled as “gut metagenome” which is here counted as a single “species” name.

<sc>Fig.</sc> 5. — **Fig. 5.**
Conservation and diversity in RdRp motif C. (A) Sequence logo, produced with WebLogo (Crooks et al. 2004), showing overall amino acid frequencies in the core amino acid triplet and the five flanking amino acids on either side. All nonidentical trimmed RdRp sequences in our study were used (cdhit -c 1.0). (B) Schematic representation of motif C central triplet variability overlaid on the inferred evolutionary relationships of RNA viruses from Wolf et al. (2018, 2020).

<sc>Fig.</sc> 6. — **Fig. 6.**
Heatmap of pHMM match co-occurrences for each RdRp sequence. All classified group ref, nr/nt, and TSA RdRp ORFs were used (supplementary data set 1, Supplementary Material online). For each group on the y-axis (best match pHMM), the number of co-occurrences with each group on the x-axis (second best match pHMM) was determined, and the count was normalized by the maximum count for the group given on the y-axis. Thus 1.0 is the highest co-occurrence score, whereas 0.0 corresponds to pairs of pHMMs that were never matched by the same sequence.

<sc>Fig.</sc> 7. — **Fig. 7.**
Phylogenetic tree of sequences (classified or unclassified) with best match to the orthomyxovirus-like pHMM. Sequences shorter than 100 a.a. were removed, and then, sequences with >95% identity were clustered, and only the longest sequence in each cluster was retained as a representative. NCBI accession numbers, virus names, TSA target organism names, and group-representative icons/colors are shown (key at left). Currently defined genera are identified with colored highlighting (key at right). Sequences labeled as “merged” derive from merged overlapping contigs. Sequences labeled as “glued” comprise multiple concatenated ORFs from a contig that was inferred to likely have sequence quality issues which introduce stop codons (e.g., via frameshift errors—common with 454 sequencing) or potentially derive from mutated endogenized viral elements, EVEs). Ferret icon has been added but it is a well-known host type.

<sc>Fig.</sc> 8. — **Fig. 8.**
Splicing in a new rhabdo-like virus sequence. (A) Genome map of the rhabdo-like virus derived from the GEZL01 TSA data set. The diagram illustrates ORFs in the antigenome after removal of the identified introns. The positions of the removed introns are indicated. Putative transcription stop–start (TSS) sequences were identified between the ORFs, and the corresponding inferred mRNAs and their products (where identified) are indicated below as well as domains of the L protein. (B) Phylogenetic tree of Mononegavirales L protein sequences showing the placement of the GEZL01-derived rhabdo-like virus. For visual convenience, some clades are collapsed into isosceles triangles. Names of sequences/clades with known splicing are written in a different color (green; the GEZL01 sequence is marked with an arrow). See supplementary figure S14, Supplementary Material online for the complete tree. (C) Sequence logo generated from the three identified copies of the putative TSS sequence (shown in the antigenome sense), using CIAlign v 1.1.0 (Tumescheit et al. 2022).

See this image and copyright information in PMC

References

1. Aiewsakun P, Katzourakis A. 2015. Endogenous viruses: connecting recent and ancient viral evolution. Virology 479–480:26–37. - PubMed
1. Aiewsakun P, Simmonds P. 2018. The genomic underpinnings of eukaryotic virus taxonomy: creating a sequence-based framework for family-level virus classification. Microbiome 6(1):38. - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. - PubMed
1. Arjona-Lopez JM, Telengech P, Jamal A, Hisano S, Kondo H, Yelin MD, Arjona-Girona I, Kanematsu S, Lopez-Herrera CJ, Suzuki N. 2018. Novel, diverse RNA viruses from Mediterranean isolates of the phytopathogenic fungus, Rosellinia necatrix: insights into evolutionary biology of fungal viruses. Environ Microbiol. 20(4):1464–1483. - PubMed
1. Babaian A, Edgar RC. 2022. Ribovirus classification by a polymerase barcode sequence. PeerJ. 10:e14055. 10.7717/peerj.14055 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification of RNA Virus-Derived RdRp Sequences in Publicly Available Transcriptomic Data Sets

Affiliation

Identification of RNA Virus-Derived RdRp Sequences in Publicly Available Transcriptomic Data Sets

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials