. 2021 Feb 1;9(1):37.

doi: 10.1186/s40168-020-00990-y.

VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses

Jiarong Guo¹, Ben Bolduc¹, Ahmed A Zayed¹, Arvind Varsani^{2

3}, Guillermo Dominguez-Huerta¹, Tom O Delmont⁴, Akbar Adjie Pratama¹, M Consuelo Gazitúa⁵, Dean Vik¹, Matthew B Sullivan^{6

7

8}, Simon Roux⁹

Affiliations

¹ Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA.
² The Biodesign Center for Fundamental and Applied Microbiomics, Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, 85287, USA.
³ Structural Biology Research Unit, Department of Integrative Biomedical Sciences, University of Cape Town, Observatory, Cape Town, 7701, South Africa.
⁴ Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France.
⁵ Viromica, 7870582, Santiago, Chile.
⁶ Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA. sullivan.948@osu.edu.
⁷ Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, OH, 43210, USA. sullivan.948@osu.edu.
⁸ Center of Microbiome Science, Ohio State University, Columbus, OH, 43210, USA. sullivan.948@osu.edu.
⁹ DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA. sroux@lbl.gov.

PMID: 33522966
PMCID: PMC7852108
DOI: 10.1186/s40168-020-00990-y

VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses

Jiarong Guo et al. Microbiome. 2021.

. 2021 Feb 1;9(1):37.

doi: 10.1186/s40168-020-00990-y.

Authors

Affiliations

¹ Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA.
² The Biodesign Center for Fundamental and Applied Microbiomics, Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, 85287, USA.
³ Structural Biology Research Unit, Department of Integrative Biomedical Sciences, University of Cape Town, Observatory, Cape Town, 7701, South Africa.
⁴ Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057, Evry, France.
⁵ Viromica, 7870582, Santiago, Chile.
⁶ Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA. sullivan.948@osu.edu.
⁷ Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, OH, 43210, USA. sullivan.948@osu.edu.
⁸ Center of Microbiome Science, Ohio State University, Columbus, OH, 43210, USA. sullivan.948@osu.edu.
⁹ DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA. sroux@lbl.gov.

PMID: 33522966
PMCID: PMC7852108
DOI: 10.1186/s40168-020-00990-y

Abstract

Background: Viruses are a significant player in many biosphere and human ecosystems, but most signals remain "hidden" in metagenomic/metatranscriptomic sequence datasets due to the lack of universal gene markers, database representatives, and insufficiently advanced identification tools.

Results: Here, we introduce VirSorter2, a DNA and RNA virus identification tool that leverages genome-informed database advances across a collection of customized automatic classifiers to improve the accuracy and range of virus sequence detection. When benchmarked against genomes from both isolated and uncultivated viruses, VirSorter2 uniquely performed consistently with high accuracy (F1-score > 0.8) across viral diversity, while all other tools under-detected viruses outside of the group most represented in reference databases (i.e., those in the order Caudovirales). Among the tools evaluated, VirSorter2 was also uniquely able to minimize errors associated with atypical cellular sequences including eukaryotic genomes and plasmids. Finally, as the virosphere exploration unravels novel viral sequences, VirSorter2's modular design makes it inherently able to expand to new types of viruses via the design of new classifiers to maintain maximal sensitivity and specificity.

Conclusion: With multi-classifier and modular design, VirSorter2 demonstrates higher overall accuracy across major viral groups and will advance our knowledge of virus evolution, diversity, and virus-microbe interaction in various ecosystems. Source code of VirSorter2 is freely available ( https://bitbucket.org/MAVERICLab/virsorter2 ), and VirSorter2 is also available both on bioconda and as an iVirus app on CyVerse ( https://de.cyverse.org/de ). Video abstract.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Overview of the VirSorter2 framework. Schematic of the viral prediction pipeline used in VirSorter2. “hmmDB” represents databases of HMM profiles including viral HMMs from Xfam (described in the “Methods” section) and viral protein families (VPF) from JGI Earth’s Virome [17], and cellular HMMs (archaeal, bacterial, eukaryotic) as well as “mixed” HMMs (not specific to either virus or cellular organisms) from Pfam [43]. A default cutoff of 30 is used for the HMM searches. “Classifiers” refers to random forest classifiers trained on known viral genomes and cellular genomes from different viral groups (see “Training classifiers” section in “Methods”). The default max score cutoff is set to 0.5

**Fig. 2**
Boxplot of different features across non-viral and viral groups. “Nonviral” includes bacteria and archaea, fungi and protozoa, and plasmids. A subset of 100 random genome fragments were used for each group. “% of viral gene” is calculated as the percent of genes annotated as viral (best hit to viral HMMs) of all genes; “% of bacterial gene” is calculated as the percent of genes annotated as bacterial (best hit to bacterial HMMs) of all genes; “Strand switch frequency” is the percent of genes located on a different strand from the upstream gene (scanning from 5′ to 3′ in the + strand); “Gene density” is the average number of genes in every 1000 bp sequence (total number of genes divided by contig length and then multiplied by 1000); “Average GC content of genes” is the mean of GC content of all genes in a contig; “TATATA_3-6 motif frequency” is the percent of ribosomal binding sites (RBS) with “TATATA_3-6” motif

**Fig. 3**
Tool performances on dsDNA phages from different data sources. VirSorter2 consistently has comparable or better performance than existing tools in identifying dsDNA phages. Genome fragments of different lengths (x-axis) are generated from genomes in the order *Caudovirales* in NCBI Viral RefSeq (a), proviruses extracted from microbial genomes in NCBI RefSeq (b) [48], and other sources (c) [15, 16]. An equal number (50) of viral and non-viral (archaea and bacteria, fungi and protozoa, and plasmids) genome fragments were combined as an input for the tested tools. Error bars show 95% confidence intervals over five replicates (100 sequences each as described above). F1 score is used as the metric (y-axis) to compare tools, while detailed recall and precision results are available in Figs. S1 and S2. The dotted line is y = 0.8

**Fig. 4**
Tool performances on different viral groups (other than dsDNA phage) from different data sources. VirSorter2 consistently outperforms existing tools in identifying viral groups outside dsDNA phages Genome fragments of different lengths (x-axis) are generated from NCBI RefSeq (“refseq”) genomes in each viral group and other sources (“non-refseq”) [, –47, 53]. “RNA-non-refseq*” is a collection of ssRNA phage genomes [45]. An equal number (50) of viral and non-viral (archaea and bacteria, fungi and protozoa, and plasmids) genome fragments were combined as an input for the tested tools. F1 score is used as the metric (y-axis) to compare tools. The dotted horizontal line is y = 0.8. *vs2* VirSorter2, *vs1* VirSorter, vf VirFinder, *dvf* DeepVirFinder, mv MARVEL, vb VIBRANT

**Fig. 5**
False positives comparison of tools on eukaryotes and plasmids. Genome fragments (50) of different lengths (x-axis) were generated from eukaryotic genomes (fungi and protozoa) in NCBI RefSeq, and plasmids. Percent of genome fragments classified as viral is used as the metric (y-axis) to compare tools. *vs2* VirSorter2, *vs1* VirSorter, vf VirFinder, *dvf* DeepVirFinder, mv MARVEL, vb VIBRANT

See this image and copyright information in PMC

References

1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. Science. 2008;320:1034–1039. doi: 10.1126/science.1153213. - DOI - PubMed
1. Fierer N. Embracing the unknown: disentangling the complexities of the soil microbiome. Nat Rev Microbiol. 2017;15:579–590. doi: 10.1038/nrmicro.2017.87. - DOI - PubMed
1. Sonnenburg ED, Sonnenburg JL. The ancestral and industrialized gut microbiota and implications for human health. Nat Rev Microbiol. 2019;17:383–390. doi: 10.1038/s41579-019-0191-8. - DOI - PubMed
1. Wang J, Jia H. Metagenome-wide association studies: fine-mining the microbiome. Nat Rev Microbiol. 2016;14:508–522. doi: 10.1038/nrmicro.2016.83. - DOI - PubMed
1. Quince C, Walker AW, Simpson JT, Loman NJ, Segata N. Shotgun metagenomics, from sampling to analysis. Nat Biotechnol. 2017;35:833–844. doi: 10.1038/nbt.3935. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses

Affiliations

VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources