. 2022 Sep 20;23(5):bbac182.

doi: 10.1093/bib/bbac182.

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Jiayu Shang¹, Yanni Sun¹

Affiliations

PMID: 35595715
PMCID: PMC9487644
DOI: 10.1093/bib/bbac182

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Jiayu Shang et al. Brief Bioinform. 2022.

. 2022 Sep 20;23(5):bbac182.

doi: 10.1093/bib/bbac182.

Authors

Jiayu Shang¹, Yanni Sun¹

Affiliation

¹ Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China SAR.

PMID: 35595715
PMCID: PMC9487644
DOI: 10.1093/bib/bbac182

Abstract

Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37%. In addition, CHERRY's performance on short contigs is more stable than other tools.

Keywords: deep learning; graph convolutional network; link prediction; phage host prediction.

PubMed Disclaimer

Figures

**Figure 1**
The key components of CHERRY. **(A)** The multimodal knowledge graph. The triangle represents the prokaryotic node and circle represents virus nodes. Different colors represent different taxonomic labels of the prokaryotes. I–III illustrate graph convolution using neighbors of increasing orders. **(B)** The graph convolutional encoder of CHERRY. **(C)** The decoder of CHERRY.

**Figure 2**
The benchmark dataset for virus–host interactions. (A) The viruses and their hosts in the training and test set, respectively. (B) Similarity between viruses in the training and test sets.

**Figure 3**
Visualization of the multimodal graph. The colors of the nodes represent their labels. For prokaryotic nodes, the labels represent their species. For virus nodes, the labels represent their hosts’ species. Because there are a large number of labels, this graph only colors the top eight labels with the largest number of nodes. All others are gray.

**Figure 4**
Host prediction performance on the benchmark dataset. Y-axis: accuracy. **(A)** 10-fold cross-validation on the training set. X-axis represents different types of graphs and BLAST-based host prediction. *without graph*: training with only decoder. *Virus–virus*: the knowledge graph only contains *virus–virus* edges. *virus–prokaryote*: the knowledge graph only contains *virus–prokaryote* edges. *random sampling*: the model is trained on the complete graph with a randomly sampled negative set. *complete graph*: the model is trained on the complete graph with negative sampling. Error bar represents the highest, lowest and average accuracy of the 10-fold cross-validation. **(B)** Comparison of host prediction accuracy on the test set from species to phylum. Tools that can output predictions at the species level (PHIST, PHIAF, vHULK, DeepHost, VHM-net and CHERRY) are grouped together and ordered based on their species-level performance.

**Figure 5**
The impact of training-vs-test sequence similarity on host prediction at the genus level. X-axis: maximum dashing similarity. Left Y-axis: accuracy (line segments). Right Y-axis: number of test viruses under each similarity cutoff (gray bars).

**Figure 6**
Host prediction accuracy for viruses that lack significant alignments against the prokaryotes. X-axis: taxonomic rankings. Y-axis: accuracy.

**Figure 7**
A case study for a sub-graph with heterogeneous labels. Triangles: prokaryotic nodes. Circles: virus nodes. White nodes: test viruses. Nodes with other colors: training samples. Different colors represent different species/labels. The open-end edges adjacent to the nodes indicate that these nodes have more connections.

**Figure 8**
Host prediction performance on contigs. X-axis: length of the input contigs. Y-axis: accuracy.

**Figure 9**
Host prediction results for different groups of viruses. X-axis: taxonomic rank. Y-axis: accuracy. ALL: the accuracy on the whole dataset, which is the same as Figure 4 (B). Other: accuracy for viruses that do not belong to *Caudovirales*.

**Figure 10**
The experimental results of top- prediction. **(A)**: Tendency of the prediction score. X-axis: the sorted index by . Y-axis: average score of the top- prediction. **(B)**: The accuracy using top- prediction.

formula image — **Figure 10**
The experimental results of top- prediction. **(A)**: Tendency of the prediction score. X-axis: the sorted index by . Y-axis: average score of the top- prediction. **(B)**: The accuracy using top- prediction.

**Figure 11**
The experimental results on the MetaHiC dataset. For each bin, we use the lowest rank of the assigned taxon as the host label for the phage contigs in the bin. Phage contigs from the same bins have the same label. **(A)**: The number of phage contigs with host labels at different taxonomic ranks. **(B)**: The host prediction accuracy (Y-axis) on the 6545 phage contigs. The comparison includes six tools that can predict hosts at species level.

**Figure 12**
Host prediction on the glacier metagenomic data. The numbers without parentheses represent the number of viruses. The numbers with parentheses represent the number of viruses with the same predicted hosts. For example, 12 viruses have predictions by both CHERRY and BLASTN and 11 of them have the same predicted hosts.

**Figure 13**
Host prediction on the gut metagenomic data. The numbers without parentheses represent the number of viruses. The numbers with parentheses represent the number of viruses with the same predicted hosts.

**Figure 14**
The precision-recall curve of predicting viruses infecting targeted prokaryotes. X-axis: recall, Y-axis: precision. The performance for three thresholds 0.95, 0.9 and 0.8 are marked with the the cross sign on the curve.

See this image and copyright information in PMC

Cited by

Correlation between the gut microbiome and neurodegenerative diseases: a review of metagenomics evidence.
Liu X, Liu Y, Liu J, Zhang H, Shan C, Guo Y, Gong X, Cui M, Li X, Tang M. Liu X, et al. Neural Regen Res. 2024 Apr;19(4):833-845. doi: 10.4103/1673-5374.382223. Neural Regen Res. 2024. PMID: 37843219 Free PMC article. Review.
IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning.
Yin H, Wu S, Tan J, Guo Q, Li M, Guo J, Wang Y, Jiang X, Zhu H. Yin H, et al. Gigascience. 2024 Jan 2;13:giae018. doi: 10.1093/gigascience/giae018. Gigascience. 2024. PMID: 38649300 Free PMC article.
Comparative Genomics of Closely-Related Gordonia Cluster DR Bacteriophages.
Versoza CJ, Howell AA, Aftab T, Blanco M, Brar A, Chaffee E, Howell N, Leach W, Lobatos J, Luca M, Maddineni M, Mirji R, Mitra C, Strasser M, Munig S, Patel Z, So M, Sy M, Weiss S, Pfeifer SP. Versoza CJ, et al. Viruses. 2022 Jul 27;14(8):1647. doi: 10.3390/v14081647. Viruses. 2022. PMID: 36016269 Free PMC article.
Identification and classification of the genomes of novel microviruses in poultry slaughterhouse.
Xie K, Lin B, Sun X, Zhu P, Liu C, Liu G, Cao X, Pan J, Qiu S, Yuan X, Liang M, Jiang J, Yuan L. Xie K, et al. Front Microbiol. 2024 May 2;15:1393153. doi: 10.3389/fmicb.2024.1393153. eCollection 2024. Front Microbiol. 2024. PMID: 38756731 Free PMC article.
Protein Set Transformer: A protein-based genome language model to power high diversity viromics.
Martin C, Gitter A, Anantharaman K. Martin C, et al. bioRxiv [Preprint]. 2025 Jun 4:2024.07.26.605391. doi: 10.1101/2024.07.26.605391. bioRxiv. 2025. PMID: 39131363 Free PMC article. Preprint.

See all "Cited by" articles

References

1. Galiez C, Siebert M, Enault F, et al. . WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 2017;33(19):3113–4. - PMC - PubMed
1. Congyu L, Zhang Z, Cai Z, et al. . Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol 2021;19(1):1–11. - PMC - PubMed
1. Tan J, Fang Z, Shufang W, et al. . HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics 2021;543–5. - PMC - PubMed
1. Pons JC, Paez-Espino D, Riera G, et al. . VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 2021;1805–13. - PMC - PubMed
1. Coutinho FH, Zaragoza-Solas A, López-Pérez M, et al. . RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns 2021;2:100274. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Affiliation

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources