Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 21;11(1):84.
doi: 10.1186/s40168-023-01533-x.

Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data

Affiliations

Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data

Siu Fung Stanley Ho et al. Microbiome. .

Abstract

Background: The prediction of bacteriophage sequences in metagenomic datasets has become a topic of considerable interest, leading to the development of many novel bioinformatic tools. A comparative analysis of ten state-of-the-art phage identification tools was performed to inform their usage in microbiome research.

Methods: Artificial contigs generated from complete RefSeq genomes representing phages, plasmids, and chromosomes, and a previously sequenced mock community containing four phage species, were used to evaluate the precision, recall, and F1 scores of the tools. We also generated a dataset of randomly shuffled sequences to quantify false-positive calls. In addition, a set of previously simulated viromes was used to assess diversity bias in each tool's output.

Results: VIBRANT and VirSorter2 achieved the highest F1 scores (0.93) in the RefSeq artificial contigs dataset, with several other tools also performing well. Kraken2 had the highest F1 score (0.86) in the mock community benchmark by a large margin (0.3 higher than DeepVirFinder in second place), mainly due to its high precision (0.96). Generally, k-mer-based tools performed better than reference similarity tools and gene-based methods. Several tools, most notably PPR-Meta, called a high number of false positives in the randomly shuffled sequences. When analysing the diversity of the genomes that each tool predicted from a virome set, most tools produced a viral genome set that had similar alpha- and beta-diversity patterns to the original population, with Seeker being a notable exception.

Conclusions: This study provides key metrics used to assess performance of phage detection tools, offers a framework for further comparison of additional viral discovery tools, and discusses optimal strategies for using these tools. We highlight that the choice of tool for identification of phages in metagenomic datasets, as well as their parameters, can bias the results and provide pointers for different use case scenarios. We have also made our benchmarking dataset available for download in order to facilitate future comparisons of phage identification tools. Video Abstract.

Keywords: Bacteriophage; Benchmarking; Machine learning; Metagenome; Microbiome; Phage.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of RefSeq benchmarking workflow. All bacterial and archaeal chromosomes and plasmids and phage genomes that were deposited in the RefSeq database between 1 January 2020 and 12 August 2021 inclusive were downloaded. The phage genomes were used to create a positive test set and the chromosomes and plasmids for a negative set. The sequences were dereplicated with the training sets for each machine/deep learning tool that was benchmarked (highlighted in red), as well as any RefSeq sequences deposited prior to 2020. The negative set was down sampled to produce a positive:negative ratio of approximately 1:19 to replicate a typical gut microbiome. Prophages were identified and removed with Phigaro and PhageBoost. Any host sequences with greater than 30% of open read frames having hits to the Prokaryotic Virus Orthologous Groups database were then removed. All sequences were then uniformly fragmented into artificial contigs with lengths between 1 and 15 kbp. All identification tools were then run on the artificial contig sets
Fig. 2
Fig. 2
Comparison of viral identification tools on artificial RefSeq contigs. Contigs were generated by randomly fragmenting complete bacterial/archaeal/phage genomes and plasmids deposited in the NCBI Reference Sequence Database (RefSeq) between 1 January 2018 and 2 July 2020, to a uniform distribution. Each tool was then separately run on the true positive (phage genome fragments) and negative (bacterial/archaeal chromosome and plasmid fragments) datasets. For tools which score/probability threshold or categories could be manually adjusted, values/categories were selected based on optimal F1 scores
Fig. 3
Fig. 3
Comparison of viral identification tools on uneven mock community samples. Mock community reads were retrieved from a previous study [56] and assembled with metaSPAdes. Prophages were detected and removed with Phigaro and PhageBoost before running each identification tool using optimal thresholds based on previous benchmarks except for viralVerify and VirSorter. F1 score, precision and recall metrics are displayed as separate panels. Each sample is plotted as a single point for each tool, with a boxplot indicating the interquartile ranges, extremes and mean of all three samples
Fig. 4
Fig. 4
Estimation of diversity metrics of tool-predicted virome populations. To assess the impact of each tool on population diversity, four simulated virome assemblies from Roux et al. [57] were downloaded. Each programme was then run to determine the subset of predicted viral contigs. Reads were mapped to these contig subsets, and mapped reads were then subsequently mapped to a pool of population contigs. All diversity metrics were computed by the R package “vegan”. “Default” in each plot indicates each sample’s original assembly. A Number of genomes observed from read mapping to predicted viral contig populations for each tool. B Comparison of estimated Shannon diversity indices from each tool’s virome subset. Estimations are based on read counts that were normalised by contig size and sequencing depth of the virome. C Comparison of Simpson diversity indices from each tool’s virome subset. D Nonmetric multidimensional scaling (NMDS) ordination plot of Bray-Curtis dissimilarity of virome subsets predicted by each viral identification tool. Ellipses indicate the 95% confidence interval for each sample cluster’s centroid. Samples are represented by the same symbol and ellipse line type; tools are denoted by colour
Fig. 5
Fig. 5
Comparison of tool runtimes on the positive RefSeq artificial contig set. Wall runtime of each tool on mock community samples was recorded on a 16 VCPU, 108-GB RAM, and Linux high-performance cluster without GPU acceleration. Tools were run with 16 threads where it could be set as a parameter (all tools except MetaPhinder, PPR-Meta, and Seeker). The RefSeq-positive set contains approximately 53.4 million bp

References

    1. Parikka KJ, Romancer ML, Wauters N, Jacquet S. Deciphering the virus-to-prokaryote ratio (VPR): insights into virus–host relationships in a variety of ecosystems. Biol Rev. 2017;92:1081–1100. doi: 10.1111/brv.12271. - DOI - PubMed
    1. CobiánGüemes AG, Youle M, Cantú VA, Felts B, Nulton J, Rohwer F. Viruses as winners in the game of life. Annu Rev Virol. 2016;3:197–214. doi: 10.1146/annurev-virology-100114-054952. - DOI - PubMed
    1. Hoyles L, McCartney AL, Neve H, Gibson GR, Sanderson JD, Heller KJ, et al. Characterization of virus-like particles associated with the human faecal and caecal microbiota. Res Microbiol. 2014;165:803–812. doi: 10.1016/j.resmic.2014.10.006. - DOI - PubMed
    1. Silveira CB, Rohwer FL. Piggyback-the-winner in host-associated microbial communities. Npj Biofilms Microbiomes. 2016;2:1–5. doi: 10.1038/npjbiofilms.2016.10. - DOI - PMC - PubMed
    1. Emerson JB, Roux S, Brum JR, Bolduc B, Woodcroft BJ, Jang HB, et al. Host-linked soil viral ecology along a permafrost thaw gradient. Nat Microbiol. 2018;3:870–880. doi: 10.1038/s41564-018-0190-y. - DOI - PMC - PubMed

Publication types

LinkOut - more resources