Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar;8(1):64-77.
doi: 10.1007/s40484-019-0187-4.

Identifying viruses from metagenomic data using deep learning

Affiliations

Identifying viruses from metagenomic data using deep learning

Jie Ren et al. Quant Biol. 2020 Mar.

Abstract

Background: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.

Methods: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.

Results: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.

Conclusions: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

Keywords: deep learning; machine learning; metagenome; virus identification.

PubMed Disclaimer

Conflict of interest statement

The authors Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin and Fengzhu Sun declare that they have no conflicts of interest. All procedures performed in studies were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Figures

Figure 1.
Figure 1.. The deep learning framework of DeepVirFinder.
Sequences from viral genomes and prokaryotic genomes are used for training the model. The neural network is composed by a convolutional layer, a max pooling layer, a dense layer with ReLU activation function, and a final dense layer with sigmoid function to generate the prediction score between 0 and 1. The higher score indicates the more likely a sequence is from viral genomes. For each sequence, both forward and its reverse complementary are fed into the same neural networks, and the final prediction score is the average of the two corresponding prediction scores.
Figure 2.
Figure 2.. Comparison of DeepVirFinder with VirFinder and the effect of contig length and mutation rates on the performance of DeepVirFinder.
(A) AUROCs for VirFinder and DeepVirFinder when trained on sequences before May 2015, and tested on sequences after May 2015. See Supplementary Fig. S1 for the exact numbers and the standard errors. (B) AUROCs for different combinaions of sequence lengths used for training and testing. (C) AUROCs for prediction when adding mutations at different rates.
Figure 3.
Figure 3.. Comparison of AUROCs between the model trained using only viral RefSeq, and the model trained using the enlarged dataset including millons of sequences from metavirome.
(A) The AUROCs for predicting 500 bp viral sequences from different host phyla. The under-represented viruses groups, viruses infecting Crenarcheota, Bacteroidetes (B) the overall AUROCs between the two models at different sequence lengths.
Figure 4.
Figure 4.. Evaluation of the performance of DeepVirFinder on viral contigs of variable lengths in simulated metagenomic samples with various viral fractions.
(A) The distribution of contig length used for simulating metagenomic samples, and the (B) AUROC and (C) AUPRC for predicting viral sequences with various viral fractions (10%, 50% and 90%) for contigs of different lengths.

References

    1. Norman JM, Handley SA, Baldridge MT, Droit L, Liu CY, Keller BC, Kambal A, Monaco CL, Zhao G, Fleshner P, et al. (2015) Disease-specific alterations in the enteric virome in inflammatory bowel disease. Cell, 160, 447–460 - PMC - PubMed
    1. Reyes A, Blanton LV, Cao S, Zhao G, Manary M, Trehan I, Smith MI, Wang D, Virgin HW, Rohwer F, et al. (2015) Gut DNA viromes of Malawian twins discordant for severe acute malnutrition. Proc. Natl. Acad. Sci. USA, 112, 11941–11946 - PMC - PubMed
    1. Ma Y, You X, Mai G, Tokuyasu T and Liu C (2018) A human gut phage catalog correlates the gut phageome with type 2 diabetes. Microbiome, 6, 24. - PMC - PubMed
    1. Roux S, Enault F, Hurwitz BL and Sullivan MB (2015) VirSorter: mining viral signal from microbial genomic data. PeerJ, 3, e985. - PMC - PubMed
    1. Ren J, Ahlgren NA, Lu YY, Fuhrman JA and Sun F (2017) VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 5, 69. - PMC - PubMed