Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 7:9:304.
doi: 10.3389/fgene.2018.00304. eCollection 2018.

MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins

Affiliations

MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins

Deyvid Amgarten et al. Front Genet. .

Abstract

Here we present MARVEL, a tool for prediction of double-stranded DNA bacteriophage sequences in metagenomic bins. MARVEL uses a random forest machine learning approach. We trained the program on a dataset with 1,247 phage and 1,029 bacterial genomes, and tested it on a dataset with 335 bacterial and 177 phage genomes. We show that three simple genomic features extracted from contig sequences were sufficient to achieve a good performance in separating bacterial from phage sequences: gene density, strand shifts, and fraction of significant hits to a viral protein database. We compared the performance of MARVEL to that of VirSorter and VirFinder, two popular programs for predicting viral sequences. Our results show that all three programs have comparable specificity, but MARVEL achieves much better performance on the recall (sensitivity) measure. This means that MARVEL should be able to identify many more phage sequences in metagenomic bins than heretofore has been possible. In a simple test with real data, containing mostly bacterial sequences, MARVEL classified 58 out of 209 bins as phage genomes; other evidence suggests that 57 of these 58 bins are novel phage sequences. MARVEL is freely available at https://github.com/LaboratorioBioinformatica/MARVEL.

Keywords: machine learning; microbiome; phage; random forest; virus.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Scatter plot of bacterial and phage genomes using two of the three features as axes: strand shifts by total number of genes and density of genes. Green and red dots represent bacterial and phage genomes, respectively.
FIGURE 2
FIGURE 2
MARVEL’s performance in simulated bins obtained from the testing set of RefSeq genomes. Recall, specificity, accuracy and F1 score are shown for bins composed of different contig lengths.
FIGURE 3
FIGURE 3
Performance comparison of MARVEL, VirSorter, and VirFinder. Means were compared using Wilcoxon signed-rank test. Standard deviation of 30 replicates are show by error bars. denotes statistically significant difference.

References

    1. Ackermann H.-W. (2007). 5500 Phages examined in the electron microscope. Arch. Virol. 152 227–243. 10.1007/s00705-006-0849-1 - DOI - PubMed
    1. Amgarten D., Martins L. F., Lombardi K. C., Antunes L. P., de Souza A. P. S., Nicastro G. G., et al. (2017). Three novel Pseudomonas phages isolated from composting provide insights into the evolution and diversity of tailed phages. BMC Genomics 18:346. 10.1186/s12864-017-3729-z - DOI - PMC - PubMed
    1. Antunes L. P., Martins L. F., Pereira R. V., Thomas A. M., Barbosa D., Lemos L. N., et al. (2016). Microbial community structure and dynamics in thermophilic composting viewed through metagenomics and metatranscriptomics. Sci. Rep. 6:38915. 10.1038/srep38915 - DOI - PMC - PubMed
    1. Ashelford K. E., Day M. J., Fry J. C. (2003). Elevated abundance of bacteriophage infecting bacteria in soil. Appl. Environ. Microbiol. 69 285–289. 10.1128/AEM.69.1.285-289.2003 - DOI - PMC - PubMed
    1. Bahir I., Fromer M., Prat Y., Linial M. (2009). Viral adaptation to host: a proteome based analysis of codon usage and amino acid preferences. Mol. Syst. Biol. 5:311. 10.1038/msb.2009.71 - DOI - PMC - PubMed

LinkOut - more resources