. 2023 Apr 21;21(4):e3002083.

doi: 10.1371/journal.pbio.3002083. eCollection 2023 Apr.

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Simon Roux¹, Antonio Pedro Camargo¹, Felipe H Coutinho², Shareef M Dabdoub³, Bas E Dutilh^{4

5}, Stephen Nayfach¹, Andrew Tritt⁶

Affiliations

¹ DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America.
² Instituto de Ciencias del Mar (ICM-CSIC), Barcelona, Spain.
³ Division of Biostatistics and Computational Biology, University of Iowa College of Dentistry, Iowa City, Iowa, United States of America.
⁴ Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University, Jena, Germany.
⁵ Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, the Netherlands.
⁶ Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America.

PMID: 37083735
PMCID: PMC10155999
DOI: 10.1371/journal.pbio.3002083

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Simon Roux et al. PLoS Biol. 2023.

. 2023 Apr 21;21(4):e3002083.

doi: 10.1371/journal.pbio.3002083. eCollection 2023 Apr.

Authors

Simon Roux¹, Antonio Pedro Camargo¹, Felipe H Coutinho², Shareef M Dabdoub³, Bas E Dutilh^{4

5}, Stephen Nayfach¹, Andrew Tritt⁶

Affiliations

¹ DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America.
² Instituto de Ciencias del Mar (ICM-CSIC), Barcelona, Spain.
³ Division of Biostatistics and Computational Biology, University of Iowa College of Dentistry, Iowa City, Iowa, United States of America.
⁴ Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University, Jena, Germany.
⁵ Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Utrecht, the Netherlands.
⁶ Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America.

PMID: 37083735
PMCID: PMC10155999
DOI: 10.1371/journal.pbio.3002083

Abstract

The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses.

Copyright: © 2023 Roux et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Comparison of different host prediction approaches on a single test dataset.**
(A) Total number of predictions and number of correct predictions (y-axis) obtained for each tool (x-axis) using a “best hit” approach and relaxed cutoffs (see Methods) on sequences from the test dataset (S2 Table). For each tool, the number of correct predictions is indicated by the colored bar, while the total number of predictions is indicated by the gray bar. Similar plots including the whole test dataset, virulent phages only, and temperate phages only are available in S2 Fig. (B) Precision-Recall curves for the different tools, using the same color code as in panels A and C. Two standard thresholds, 5% and 10% false discovery rates, are indicated by horizontal dashed lines. (C) Relationship between “novelty” of input virus, represented as AAI (average amino acid identity) percentage to the closest NCBI RefSeq reference on the x-axis, and the number of correct host predictions obtained with each tool. To evenly represent both “known” and “novel” input viruses, 300 sequences were randomly subsampled from each AAI percentage category (x-axis). (D) Schematic overview of iPHoP host prediction pipeline. Detailed explanations of the new steps 2 and 3 are available in Figs 2 and 3, and source data in S1 Data.

**Fig 2. Overview of the single-tool classifiers used in iPHoP.**
(A) Schematic representation of the process used to score individual hits from host-based tools. Briefly, each hit was scored by a neural network or random forest classifier, which also considered other top hits for the same virus and the same tool. This process was applied to the 5 host-based tools selected (“Blast,” “CRISPR,” “WIsH,” “VHM,” “PHP”), except for the random forest classifiers (highlighted with a *), which were only used for “Blast” and “CRISPR.” When considering multiple hits, their similarity or difference in terms of host prediction was estimated from the GTDB phylogenies [34]. (B) Illustration of how multiple hits are represented in neural networks input matrices (top) or random forest classifier inputs (bottom). Two examples are provided, one “reliable” in which the hits with high scores are all consistent and at a small distance to the candidate host considered (left), and the other “unreliable” in which a few hits with medium-to-high scores are scattered across hosts with variable distance to the candidate host considered. (C) Estimated improvement in classification provided by the automated classifiers compared to “naive” raw scores. These estimations are based on smoothed ROC curves obtained from the test dataset (see S6 Fig) and calculated as the average decrease in false discovery rate for 17 true positive rates ranging from 10% to 90%. Random forest classifiers were only evaluated for Blast and CRISPR approaches. (D) Precision Recall curves for the 2 classifiers selected for each host-based tool (see S4 Table). Conv, “Convolutional Neural Network”; “RF”, “random forest classifier”; VHM, “VirHostMatcher.” Source data are available in S1 Data.

**Fig 3. Schematic and performance of iPHoP host genus predictions.**
(A) Schematic representation of the integration process. “Individual classifiers” refer to single-tools scores calculated for each virus–candidate host pair (see Fig 2). (B) Precision Recall curve for each of the 4 scores considered in iPHoP composite score, based on the test dataset. (C) Comparison of the percentage of input sequences from the test dataset for which a correct host genus prediction was obtained, when using cutoffs limiting FDR to 20% maximum. This percentage is given for all sequences in the test dataset (“All”), and for subsets of sequences defined based on their amino acid similarity to the closest reference phage genome in the NCBI RefSeq or RaFAH database. AAI, amino acid identity; FDR, false discovery rate; RF, random forest. Source data are available in S1 Data.

**Fig 4. Overview of iPHoP host prediction for high-quality IMG/VR v3 genomes.**
(A) Distribution of the best iPHoP score for high-quality genomes from the IMG/VR v3 database by ecosystem. For each IMG/VR vOTU, the best score from iPHoP was considered if ≥75, or the vOTU was considered as not having a predicted host. The proportion of sequences for which a host prediction was available in the original IMG/VR database is indicated with a dashed red line. (B) Distribution of the type of signal used to achieve host prediction with a score ≥90 in iPHoP. “Host-based” includes all 5 host-based tools, while “Phage-based” includes predictions obtained with RaFAH. “Both” includes consistent predictions obtained with RaFAH and at least 1 host-based tool. (C) Percentage of hits from isolated or uncultivated host genomes used in host-based predictions with iPHoP scores ≥90. These are based on the individual genome hits underlying iPHoP genus-level predictions. (D) Origin of the uncultivated host genomes used in host-based predictions with iPHoP scores ≥90. The original dataset and study ID for the query virus and the uncultivated host genome were obtained from the Gold database, and when both were available, these were compared to evaluate whether the uncultivated host genome originated from the same dataset, a different dataset from the same study, or another study from the query virus. Source data are available in S1 Data.

**Fig 5. Taxonomic and environmental distribution of hosts predicted using iPHoP from the IMG/VR v3 genomes.**
(A) Archaeal (top left) and bacterial (bottom right) genome diversity from the GTDB database r202 [34]. The GTDB phylogenetic trees were collapsed at the phylum level. The status of virus association, i.e., isolated virus, predicted virus only at iPHoP score ≥95 or ≥90, or no prediction, was evaluated for each host genus, and the phyla shapes are colored according to the number of genera in each category within this phylum. (B) For each major biome type, the 10 host genera with the highest number of predicted IMG/VR high-quality virus genomes are included in the plot. Each host genus was also determined to be mainly detected in a biome type or detected across multiple biomes based on the distribution of MAGs assigned to this genus across ecosystems in the GEM catalog (see Methods). Source data are available in S1 Data.

See this image and copyright information in PMC

Cited by

Bacteriophages from treatment-naïve type 2 diabetes individuals drive an inflammatory response in human co-cultures of dendritic cells and T cells.
Scheithauer TPM, Wortelboer K, Winkelmeijer M, Verdoes X, Aydin Ö, Acherman YIZ, de Brauw ML, Nieuwdorp M, Rampanelli E, de Jonge PA, Herrema H. Scheithauer TPM, et al. Gut Microbes. 2024 Jan-Dec;16(1):2380747. doi: 10.1080/19490976.2024.2380747. Epub 2024 Jul 27. Gut Microbes. 2024. PMID: 39068518 Free PMC article.
Dispersal, habitat filtering, and eco-evolutionary dynamics as drivers of local and global wetland viral biogeography.
Ter Horst AM, Fudyma JD, Sones JL, Emerson JB. Ter Horst AM, et al. ISME J. 2023 Nov;17(11):2079-2089. doi: 10.1038/s41396-023-01516-8. Epub 2023 Sep 21. ISME J. 2023. PMID: 37735616 Free PMC article.
zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters.
Salamzade R, Tran PQ, Martin C, Manson AL, Gilmore MS, Earl AM, Anantharaman K, Kalan LR. Salamzade R, et al. Nucleic Acids Res. 2025 Jan 24;53(3):gkaf045. doi: 10.1093/nar/gkaf045. Nucleic Acids Res. 2025. PMID: 39907107 Free PMC article.
Isolation and characterization of a roseophage representing a novel genus in the N4-like Rhodovirinae subfamily distributed in estuarine waters.
Huang X, Yu C, Lu L. Huang X, et al. BMC Genomics. 2025 Mar 25;26(1):295. doi: 10.1186/s12864-025-11463-7. BMC Genomics. 2025. PMID: 40133813 Free PMC article.
Engrafting gut bacteriophages have potential to modulate microbial metabolism in fecal microbiota transplantation.
Ji S, Ahmad F, Peng B, Yang Y, Su M, Zhao X, Vatanen T. Ji S, et al. Microbiome. 2025 Jun 20;13(1):149. doi: 10.1186/s40168-025-02046-5. Microbiome. 2025. PMID: 40542451 Free PMC article.

See all "Cited by" articles

References

1. Fernández L, Rodríguez A, García P. Phage or foe: An insight into the impact of viral predation on microbial communities. ISME J. 2018;12:1171–1179. doi: 10.1038/s41396-018-0049-5 - DOI - PMC - PubMed
1. Correa AMS, Howard-Varona C, Coy SR, Buchan A, Sullivan MB, Weitz JS. Revisiting the rules of life for viruses of microorganisms. Nat Rev Microbiol. 2021;0123456789:1–13. doi: 10.1038/s41579-021-00530-x - DOI - PubMed
1. Abeles SR, Pride DT. Molecular bases and role of viruses in the human microbiome. J Mol Biol. 2014;426:3892–3906. doi: 10.1016/j.jmb.2014.07.002 - DOI - PMC - PubMed
1. Roux S, Adriaenssens EM, Dutilh BE, Koonin E V., Kropinski AM, Krupovic M, et al.. Minimum information about an uncultivated virus genome (MIUVIG). Nat Biotechnol. 2019;37:29–37. doi: 10.1038/nbt.4306 - DOI - PMC - PubMed
1. Taş N, de Jong AE, Li Y, Trubl G, Xue Y, Dove NC. Metagenomic tools in microbial ecology research. Curr Opin Biotechnol. 2021;67:184–191. doi: 10.1016/j.copbio.2021.01.019 - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Affiliations

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources