Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 4;25(1):177.
doi: 10.1186/s13059-024-03320-9.

VirRep: a hybrid language representation learning framework for identifying viruses from human gut metagenomes

Affiliations

VirRep: a hybrid language representation learning framework for identifying viruses from human gut metagenomes

Yanqi Dong et al. Genome Biol. .

Abstract

Identifying viruses from metagenomes is a common step to explore the virus composition in the human gut. Here, we introduce VirRep, a hybrid language representation learning framework, for identifying viruses from human gut metagenomes. VirRep combines a context-aware encoder and an evolution-aware encoder to improve sequence representation by incorporating k-mer patterns and sequence homologies. Benchmarking on both simulated and real datasets with varying viral proportions demonstrates that VirRep outperforms state-of-the-art methods. When applied to fecal metagenomes from a colorectal cancer cohort, VirRep identifies 39 high-quality viral species associated with the disease, many of which cannot be detected by existing methods.

Keywords: Human gut metagenomes; Language representation learning; Virus identification.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic overview of the VirRep framework. a Workflow of VirRep for predicting viruses from metagenomes. b The detailed model architecture of VirRep. c The multi-step training strategy based on the pre-train-fine-tune paradigm to train VirRep
Fig. 2
Fig. 2
Performance of VirRep and other methods on multiple human gut virome datasets. af The MCC values of VirRep and the eight popular methods on the GGCM-test dataset (a), IMGVR-gut dataset (b), DEVoC dataset (c), GPIC dataset (d), crAss-like phage dataset (e), and Lak-phage dataset (f) at various sequence length intervals. g The runtime of each method across the five sequence length intervals, where the average runtime is represented by the bar height and the error bars depict the 95% confidence intervals
Fig. 3
Fig. 3
The ablation experiments of the two encoders, pre-training, and the first-stage fine-tuning. a Radar plots showing the MCC, precision, recall, and specificity achieved by the full implementation of VirRep, the semantic-encoder-based classifier, and the alignment-encoder-based predictor on the GGCM-test dataset. b The distribution of MCC values for VirRep with complete training through all stages (full training) compared to the version without pre-training, and the version with pre-training but without first-stage fine-tuning. Comparisons are shown across five sequence length intervals on the GGCM-test, IMGVR-gut, DEVoC, GPIC, crAss-phage and Lak-phage datasets. Significance levels are denoted as ****: P0.0001, ***: P0.001, **: P0.01, *: P0.05, based on the paired t-test. Each point represents a test dataset
Fig. 4
Fig. 4
Comparing VirRep’s performance with that of other methods and method combinations on simulated metagenomic samples with varying viral proportions. a Precision-recall curves for VirRep, geNomad, and the six alignment-free methods at viral proportions of 5, 10, 50, and 90%. Numbers show the AUPRC (area under the precision-recall curve) values. b Average F1 score, precision and recall for VirRep, geNomad, and the five method combinations composed of VirSorter2 and one alignment-free method at viral proportions of 5, 10, 50, and 90%. c Average F1 score, precision and recall for VirRep, geNomad, and method combinations composed of VIBRANT and one alignment-free method at viral proportions of 5, 10, 50, and 90%. Error bar shows the 95% confidence intervals over 5 replicates
Fig. 5
Fig. 5
Application of VirRep to 128 real human gut metagenomes from 74 colorectal cancer patients and 54 healthy controls. a The number of viral populations of each category (x-axis) obtained by each method (y-axis). The maximum value in each column is highlighted in bold red font. b The significance (log10 transformed q-values) of viral populations (VPs) is given by the bar height. Horizontal line shows FDR at the level of 0.05. VPs with P < 0.001 and q < 0.05 are colored in dark gray, while others are colored in light gray. Shown are the top 90 significant VPs. c The average accuracy of tenfold cross-validation (repeated 10 times) of the logistic regression models versus the size of the marker set for each method. d Genome maps for viruses VP1279 and VP2811. e The phylogenetic tree of viruses VP1279 and VP2811

Similar articles

Cited by

References

    1. Ofir G, Sorek R. Contemporary phage biology: from classic models to new insights. Cell. 2018;172:1260–1270. doi: 10.1016/j.cell.2017.10.045. - DOI - PubMed
    1. Chevallereau A, Pons BJ, van Houte S, Westra ER. Interactions between bacterial and phage communities in natural environments. Nat Rev Microbiol. 2022;20:49–62. doi: 10.1038/s41579-021-00602-y. - DOI - PubMed
    1. Clooney AG, Sutton TD, Shkoporov AN, Holohan RK, Daly KM, O’Regan O, et al. Whole-virome analysis sheds light on viral dark matter in inflammatory bowel disease. Cell Host Microbe. 2019;26(764–778):e765. - PubMed
    1. Adiliaghdam F, Amatullah H, Digumarthi S, Saunders TL, Rahman R-U, Wong LP, et al. Human enteric viruses autonomously shape inflammatory bowel disease phenotype through divergent innate immunomodulation. Sci Immunol. 2022;7:eabn6660. doi: 10.1126/sciimmunol.abn6660. - DOI - PMC - PubMed
    1. Ma Y, You X, Mai G, Tokuyasu T, Liu C. A human gut phage catalog correlates the gut phageome with type 2 diabetes. Microbiome. 2018;6:1–12. doi: 10.1186/s40168-018-0410-y. - DOI - PMC - PubMed

LinkOut - more resources