Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 20;23(5):bbac182.
doi: 10.1093/bib/bbac182.

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Affiliations

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Jiayu Shang et al. Brief Bioinform. .

Abstract

Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37%. In addition, CHERRY's performance on short contigs is more stable than other tools.

Keywords: deep learning; graph convolutional network; link prediction; phage host prediction.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The key components of CHERRY. (A) The multimodal knowledge graph. The triangle represents the prokaryotic node and circle represents virus nodes. Different colors represent different taxonomic labels of the prokaryotes. I–III illustrate graph convolution using neighbors of increasing orders. (B) The graph convolutional encoder of CHERRY. (C) The decoder of CHERRY.
Figure 2
Figure 2
The benchmark dataset for virus–host interactions. (A) The viruses and their hosts in the training and test set, respectively. (B) Similarity between viruses in the training and test sets.
Figure 3
Figure 3
Visualization of the multimodal graph. The colors of the nodes represent their labels. For prokaryotic nodes, the labels represent their species. For virus nodes, the labels represent their hosts’ species. Because there are a large number of labels, this graph only colors the top eight labels with the largest number of nodes. All others are gray.
Figure 4
Figure 4
Host prediction performance on the benchmark dataset. Y-axis: accuracy. (A) 10-fold cross-validation on the training set. X-axis represents different types of graphs and BLAST-based host prediction. without graph: training with only decoder. Virus–virus: the knowledge graph only contains virus–virus edges. virus–prokaryote: the knowledge graph only contains virus–prokaryote edges. random sampling: the model is trained on the complete graph with a randomly sampled negative set. complete graph: the model is trained on the complete graph with negative sampling. Error bar represents the highest, lowest and average accuracy of the 10-fold cross-validation. (B) Comparison of host prediction accuracy on the test set from species to phylum. Tools that can output predictions at the species level (PHIST, PHIAF, vHULK, DeepHost, VHM-net and CHERRY) are grouped together and ordered based on their species-level performance.
Figure 5
Figure 5
The impact of training-vs-test sequence similarity on host prediction at the genus level. X-axis: maximum dashing similarity. Left Y-axis: accuracy (line segments). Right Y-axis: number of test viruses under each similarity cutoff (gray bars).
Figure 6
Figure 6
Host prediction accuracy for viruses that lack significant alignments against the prokaryotes. X-axis: taxonomic rankings. Y-axis: accuracy.
Figure 7
Figure 7
A case study for a sub-graph with heterogeneous labels. Triangles: prokaryotic nodes. Circles: virus nodes. White nodes: test viruses. Nodes with other colors: training samples. Different colors represent different species/labels. The open-end edges adjacent to the nodes indicate that these nodes have more connections.
Figure 8
Figure 8
Host prediction performance on contigs. X-axis: length of the input contigs. Y-axis: accuracy.
Figure 9
Figure 9
Host prediction results for different groups of viruses. X-axis: taxonomic rank. Y-axis: accuracy. ALL: the accuracy on the whole dataset, which is the same as Figure 4 (B). Other: accuracy for viruses that do not belong to Caudovirales.
Figure 10
Figure 10
The experimental results of top-formula image prediction. (A): Tendency of the prediction score. X-axis: the sorted index by formula image. Y-axis: average score of the top-formula image prediction. (B): The accuracy using top-formula image prediction.
Figure 11
Figure 11
The experimental results on the MetaHiC dataset. For each bin, we use the lowest rank of the assigned taxon as the host label for the phage contigs in the bin. Phage contigs from the same bins have the same label. (A): The number of phage contigs with host labels at different taxonomic ranks. (B): The host prediction accuracy (Y-axis) on the 6545 phage contigs. The comparison includes six tools that can predict hosts at species level.
Figure 12
Figure 12
Host prediction on the glacier metagenomic data. The numbers without parentheses represent the number of viruses. The numbers with parentheses represent the number of viruses with the same predicted hosts. For example, 12 viruses have predictions by both CHERRY and BLASTN and 11 of them have the same predicted hosts.
Figure 13
Figure 13
Host prediction on the gut metagenomic data. The numbers without parentheses represent the number of viruses. The numbers with parentheses represent the number of viruses with the same predicted hosts.
Figure 14
Figure 14
The precision-recall curve of predicting viruses infecting targeted prokaryotes. X-axis: recall, Y-axis: precision. The performance for three thresholds 0.95, 0.9 and 0.8 are marked with the the cross sign on the curve.

Similar articles

Cited by

References

    1. Galiez C, Siebert M, Enault F, et al. . WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 2017;33(19):3113–4. - PMC - PubMed
    1. Congyu L, Zhang Z, Cai Z, et al. . Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol 2021;19(1):1–11. - PMC - PubMed
    1. Tan J, Fang Z, Shufang W, et al. . HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics 2021;543–5. - PMC - PubMed
    1. Pons JC, Paez-Espino D, Riera G, et al. . VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 2021;1805–13. - PMC - PubMed
    1. Coutinho FH, Zaragoza-Solas A, López-Pérez M, et al. . RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns 2021;2:100274. - PMC - PubMed

Publication types