Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods
- PMID: 38376167
- PMCID: PMC10949488
- DOI: 10.1128/msystems.01105-23
Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods
Abstract
Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj ≥ 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets.
Importance: The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.
Keywords: bacteriophages; metagenomics; microbial ecology; viral discovery.
Conflict of interest statement
The authors declare no conflict of interest.
Figures








Similar articles
-
Simulation study and comparative evaluation of viral contiguous sequence identification tools.BMC Bioinformatics. 2021 Jun 16;22(1):329. doi: 10.1186/s12859-021-04242-0. BMC Bioinformatics. 2021. PMID: 34130621 Free PMC article.
-
VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences.Microbiome. 2020 Jun 10;8(1):90. doi: 10.1186/s40168-020-00867-0. Microbiome. 2020. PMID: 32522236 Free PMC article.
-
MVP: a modular viromics pipeline to identify, filter, cluster, annotate, and bin viruses from metagenomes.mSystems. 2024 Oct 22;9(10):e0088824. doi: 10.1128/msystems.00888-24. Epub 2024 Oct 1. mSystems. 2024. PMID: 39352141 Free PMC article.
-
Computational approaches to predict bacteriophage-host relationships.FEMS Microbiol Rev. 2016 Mar;40(2):258-72. doi: 10.1093/femsre/fuv048. Epub 2015 Dec 9. FEMS Microbiol Rev. 2016. PMID: 26657537 Free PMC article. Review.
-
From cultured to uncultured genome sequences: metagenomics and modeling microbial ecosystems.Cell Mol Life Sci. 2015 Nov;72(22):4287-308. doi: 10.1007/s00018-015-2004-1. Epub 2015 Aug 9. Cell Mol Life Sci. 2015. PMID: 26254872 Free PMC article. Review.
Cited by
-
Prokaryotic-virus-encoded auxiliary metabolic genes throughout the global oceans.Microbiome. 2024 Aug 29;12(1):159. doi: 10.1186/s40168-024-01876-z. Microbiome. 2024. PMID: 39198891 Free PMC article.
-
Viromics approaches for the study of viral diversity and ecology in microbiomes.Nat Rev Genet. 2025 Jul 21. doi: 10.1038/s41576-025-00871-w. Online ahead of print. Nat Rev Genet. 2025. PMID: 40691354 Review.
-
VirID: Beyond Virus Discovery-An Integrated Platform for Comprehensive RNA Virus Characterization.Mol Biol Evol. 2024 Oct 4;41(10):msae202. doi: 10.1093/molbev/msae202. Mol Biol Evol. 2024. PMID: 39331699 Free PMC article.
-
Development of a quantitative metagenomic approach to establish quantitative limits and its application to viruses.Nucleic Acids Res. 2025 Feb 27;53(5):gkaf118. doi: 10.1093/nar/gkaf118. Nucleic Acids Res. 2025. PMID: 40036505 Free PMC article.
-
A panoramic view of the virosphere in three wastewater treatment plants by integrating viral-like particle-concentrated and traditional non-concentrated metagenomic approaches.Imeta. 2024 Mar 29;3(3):e188. doi: 10.1002/imt2.188. eCollection 2024 Jun. Imeta. 2024. PMID: 38898980 Free PMC article.
References
-
- Wilhelm SW, Suttle CA. 1999. Viruses and nutrient cycles in the sea: viruses play critical roles in the structure and function of aquatic food webs. BioScience 49:781–788. 10.2307/1313569. - DOI
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources