Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 21;21(4):e3002083.
doi: 10.1371/journal.pbio.3002083. eCollection 2023 Apr.

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Affiliations

iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria

Simon Roux et al. PLoS Biol. .

Abstract

The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Comparison of different host prediction approaches on a single test dataset.
(A) Total number of predictions and number of correct predictions (y-axis) obtained for each tool (x-axis) using a “best hit” approach and relaxed cutoffs (see Methods) on sequences from the test dataset (S2 Table). For each tool, the number of correct predictions is indicated by the colored bar, while the total number of predictions is indicated by the gray bar. Similar plots including the whole test dataset, virulent phages only, and temperate phages only are available in S2 Fig. (B) Precision-Recall curves for the different tools, using the same color code as in panels A and C. Two standard thresholds, 5% and 10% false discovery rates, are indicated by horizontal dashed lines. (C) Relationship between “novelty” of input virus, represented as AAI (average amino acid identity) percentage to the closest NCBI RefSeq reference on the x-axis, and the number of correct host predictions obtained with each tool. To evenly represent both “known” and “novel” input viruses, 300 sequences were randomly subsampled from each AAI percentage category (x-axis). (D) Schematic overview of iPHoP host prediction pipeline. Detailed explanations of the new steps 2 and 3 are available in Figs 2 and 3, and source data in S1 Data.
Fig 2
Fig 2. Overview of the single-tool classifiers used in iPHoP.
(A) Schematic representation of the process used to score individual hits from host-based tools. Briefly, each hit was scored by a neural network or random forest classifier, which also considered other top hits for the same virus and the same tool. This process was applied to the 5 host-based tools selected (“Blast,” “CRISPR,” “WIsH,” “VHM,” “PHP”), except for the random forest classifiers (highlighted with a *), which were only used for “Blast” and “CRISPR.” When considering multiple hits, their similarity or difference in terms of host prediction was estimated from the GTDB phylogenies [34]. (B) Illustration of how multiple hits are represented in neural networks input matrices (top) or random forest classifier inputs (bottom). Two examples are provided, one “reliable” in which the hits with high scores are all consistent and at a small distance to the candidate host considered (left), and the other “unreliable” in which a few hits with medium-to-high scores are scattered across hosts with variable distance to the candidate host considered. (C) Estimated improvement in classification provided by the automated classifiers compared to “naive” raw scores. These estimations are based on smoothed ROC curves obtained from the test dataset (see S6 Fig) and calculated as the average decrease in false discovery rate for 17 true positive rates ranging from 10% to 90%. Random forest classifiers were only evaluated for Blast and CRISPR approaches. (D) Precision Recall curves for the 2 classifiers selected for each host-based tool (see S4 Table). Conv, “Convolutional Neural Network”; “RF”, “random forest classifier”; VHM, “VirHostMatcher.” Source data are available in S1 Data.
Fig 3
Fig 3. Schematic and performance of iPHoP host genus predictions.
(A) Schematic representation of the integration process. “Individual classifiers” refer to single-tools scores calculated for each virus–candidate host pair (see Fig 2). (B) Precision Recall curve for each of the 4 scores considered in iPHoP composite score, based on the test dataset. (C) Comparison of the percentage of input sequences from the test dataset for which a correct host genus prediction was obtained, when using cutoffs limiting FDR to 20% maximum. This percentage is given for all sequences in the test dataset (“All”), and for subsets of sequences defined based on their amino acid similarity to the closest reference phage genome in the NCBI RefSeq or RaFAH database. AAI, amino acid identity; FDR, false discovery rate; RF, random forest. Source data are available in S1 Data.
Fig 4
Fig 4. Overview of iPHoP host prediction for high-quality IMG/VR v3 genomes.
(A) Distribution of the best iPHoP score for high-quality genomes from the IMG/VR v3 database by ecosystem. For each IMG/VR vOTU, the best score from iPHoP was considered if ≥75, or the vOTU was considered as not having a predicted host. The proportion of sequences for which a host prediction was available in the original IMG/VR database is indicated with a dashed red line. (B) Distribution of the type of signal used to achieve host prediction with a score ≥90 in iPHoP. “Host-based” includes all 5 host-based tools, while “Phage-based” includes predictions obtained with RaFAH. “Both” includes consistent predictions obtained with RaFAH and at least 1 host-based tool. (C) Percentage of hits from isolated or uncultivated host genomes used in host-based predictions with iPHoP scores ≥90. These are based on the individual genome hits underlying iPHoP genus-level predictions. (D) Origin of the uncultivated host genomes used in host-based predictions with iPHoP scores ≥90. The original dataset and study ID for the query virus and the uncultivated host genome were obtained from the Gold database, and when both were available, these were compared to evaluate whether the uncultivated host genome originated from the same dataset, a different dataset from the same study, or another study from the query virus. Source data are available in S1 Data.
Fig 5
Fig 5. Taxonomic and environmental distribution of hosts predicted using iPHoP from the IMG/VR v3 genomes.
(A) Archaeal (top left) and bacterial (bottom right) genome diversity from the GTDB database r202 [34]. The GTDB phylogenetic trees were collapsed at the phylum level. The status of virus association, i.e., isolated virus, predicted virus only at iPHoP score ≥95 or ≥90, or no prediction, was evaluated for each host genus, and the phyla shapes are colored according to the number of genera in each category within this phylum. (B) For each major biome type, the 10 host genera with the highest number of predicted IMG/VR high-quality virus genomes are included in the plot. Each host genus was also determined to be mainly detected in a biome type or detected across multiple biomes based on the distribution of MAGs assigned to this genus across ecosystems in the GEM catalog (see Methods). Source data are available in S1 Data.

Similar articles

Cited by

References

    1. Fernández L, Rodríguez A, García P. Phage or foe: An insight into the impact of viral predation on microbial communities. ISME J. 2018;12:1171–1179. doi: 10.1038/s41396-018-0049-5 - DOI - PMC - PubMed
    1. Correa AMS, Howard-Varona C, Coy SR, Buchan A, Sullivan MB, Weitz JS. Revisiting the rules of life for viruses of microorganisms. Nat Rev Microbiol. 2021;0123456789:1–13. doi: 10.1038/s41579-021-00530-x - DOI - PubMed
    1. Abeles SR, Pride DT. Molecular bases and role of viruses in the human microbiome. J Mol Biol. 2014;426:3892–3906. doi: 10.1016/j.jmb.2014.07.002 - DOI - PMC - PubMed
    1. Roux S, Adriaenssens EM, Dutilh BE, Koonin E V., Kropinski AM, Krupovic M, et al.. Minimum information about an uncultivated virus genome (MIUVIG). Nat Biotechnol. 2019;37:29–37. doi: 10.1038/nbt.4306 - DOI - PMC - PubMed
    1. Taş N, de Jong AE, Li Y, Trubl G, Xue Y, Dove NC. Metagenomic tools in microbial ecology research. Curr Opin Biotechnol. 2021;67:184–191. doi: 10.1016/j.copbio.2021.01.019 - DOI - PubMed

Publication types