Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep 21:5:e3817.
doi: 10.7717/peerj.3817. eCollection 2017.

Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity

Affiliations

Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity

Simon Roux et al. PeerJ. .

Abstract

Background: Viral metagenomics (viromics) is increasingly used to obtain uncultivated viral genomes, evaluate community diversity, and assess ecological hypotheses. While viromic experimental methods are relatively mature and widely accepted by the research community, robust bioinformatics standards remain to be established. Here we used in silico mock viral communities to evaluate the viromic sequence-to-ecological-inference pipeline, including (i) read pre-processing and metagenome assembly, (ii) thresholds applied to estimate viral relative abundances based on read mapping to assembled contigs, and (iii) normalization methods applied to the matrix of viral relative abundances for alpha and beta diversity estimates.

Results: Tools specifically designed for metagenomes, specifically metaSPAdes, MEGAHIT, and IDBA-UD, were the most effective at assembling viromes. Read pre-processing, such as partitioning, had virtually no impact on assembly output, but may be useful when hardware is limited. Viral populations with 2-5 × coverage typically assembled well, whereas lesser coverage led to fragmented assembly. Strain heterogeneity within populations hampered assembly, especially when strains were closely related (average nucleotide identity, or ANI ≥97%) and when the most abundant strain represented <50% of the population. Viral community composition assessments based on read recruitment were generally accurate when the following thresholds for detection were applied: (i) ≥10 kb contig lengths to define populations, (ii) coverage defined from reads mapping at ≥90% identity, and (iii) ≥75% of contig length with ≥1 × coverage. Finally, although data are limited to the most abundant viruses in a community, alpha and beta diversity patterns were robustly estimated (±10%) when comparing samples of similar sequencing depth, but more divergent (up to 80%) when sequencing depth was uneven across the dataset. In the latter cases, the use of normalization methods specifically developed for metagenomes provided the best estimates.

Conclusions: These simulations provide benchmarks for selecting analysis cut-offs and establish that an optimized sample-to-ecological-inference viromics pipeline is robust for making ecological inferences from natural viral communities. Continued development to better accessing RNA, rare, and/or diverse viral populations and improved reference viral genome availability will alleviate many of viromics remaining limitations.

Keywords: Assembly; Benchmarks; Metagenome; Viral ecology; Virome; Virus.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. Influence of assembly software and read curation on genome recovery.
All plots display the input coverage on the x-axis, and either the cumulated genome recovery across all contigs (A & C) or the highest genome recovery by a single contig (B & D) on the y-axis. (A & B) display a comparison of assemblers applied to quality-controlled (QC) reads. (C & D) present a comparison of read pre-processing methods, all assembled with metaSPAdes. Comparable plots for reads assembled with the other assemblers are available in Fig. S5.
Figure 2
Figure 2. Types and frequency of errors observed in genome assembly from viral metagenomes.
(A) Percentage of chimeric contigs (i.e., contigs originating from two distinct genomes) across all assembled sequences, by assembler (x-axis) and read curation method (colors). (B) Percentage of chimeric contigs among large (≥10 kb) contigs, by assembler (x-axis) and read curation method (colors). (C) Percentage of false-positive circular contigs, i.e., contigs identified as circular (matching 5′ and 3′ ends) but representing 95% or less of the original genome, by assembler (x-axis) and read curation method (color). (D) Impact of strain heterogeneity (i.e., presence of multiple strains from the same population) on the assembly efficiency. These tests were computed on one mock community (Sample_1), for which each reference genome was replaced with a set of related strains with varying divergence and relative abundances. The y-axis represents the ratio between the largest contig assembled for a genome when strain heterogeneity is introduced and the same parameter without strain heterogeneity (i.e., previous assemblies of the same Sample_1). Populations are grouped based on the two main parameters explaining assembly inefficiency: proportion of the most abundant strain in the population (C, D) and divergence of strains in the population (A, B). Data presented here include assemblies from QC reads with IDBA-UD, MEGAHIT, and metaSPAdes, while the full set of parameters and approaches tested are presented in Fig. S6.
Figure 3
Figure 3. Impact of read mapping thresholds on accuracy of viral population detection.
Two parameters were investigated when parsing the mapping of individual virome reads to the population contigs pool: (i) the percentage of a contig covered by a sample to considered the contig as detected (x-axis), and (ii) the percentage of identity of reads mapping to the contig (color scale). Two pools of population contigs were tested: all non-redundant contigs of ≥500 bp (A–C), and all non-redundant contigs ≥10 kb (D–F). Three metrics were calculated to evaluate the impact of mapping reads thresholds. The detection sensitivity is estimated as the percentage of “expected” genomes (i.e., genomes covered ≥1 × in the sample) that were detected through mapping to population contigs (A and D). The false-discovery rate corresponds to the percentage of contigs detected in a sample through mapping to population contigs, but were not associated with any genomes from the initial sample (i.e., these genomes did not provide any reads to the simulated virome, so these contigs should not be detected, B and E). Finally the average number of distinct population contigs detected is calculated for each individual genome initially covered ≥1 ×, and correspond to the number of times a single genome is “counted” (i.e., multiple contigs suggest multiple populations, even though it is really just one population, C and F).
Figure 4
Figure 4. Estimation of alpha and beta diversity from virome-derived viral populations.
To evaluate the impact of varying sequencing depth, six viromes (highlighted in bold in A–C), were sub-sampled at 10% (long dash) or 1% (short dash) of the original read number (“Initial” corresponds to the assemblies presented in Figs. 1–3, for which all viromes had the same initial number of reads). A. Number of genomes observed from the read mapping to viral populations. The actual number of genomes in the initial simulated community is indicated with black dots, while estimated based on viromes are colored in red. B. Comparison of Shannon diversity index from the true community composition (black dots) and estimated from the viromes (colored dots). The different estimations are based on 3 different normalization methods: counts divided by the total number of reads sequenced in the virome and the contig size (“Normalized”), counts after rarefying all viromes to the smallest dataset and normalized by contig size (“Rarefied”), and counts normalized via DESeq (“DESeq”). (C) Comparison of Simpson diversity index from the true community composition and estimated from the viromes (color codes are the same as in B). (D) Distribution of differences in Bray–Curtis dissimilarities between samples calculated from true community composition and the same dissimilarities estimated from the viromes analysis. The different normalization methods (x-axis) are as follows: counts divided by genome size (“Counts”), counts rarefied to the smallest dataset and normalized by contig size (“Rarefied”), counts divided by the total number of reads sequenced in the library and the contig size (“Normalized”), counts normalized by metagenomeSeq (“MGSeq”), EdgeR (“RPKM”), and DESeq (“DESeq”). (E) Distribution of differences in Bray–Curtis dissimilarities between samples calculated from true community composition and the same dissimilarities estimated from virome analysis, including 6 samples sequenced at 10%. Methods are similar as in (D). (F) Distribution of differences in Bray–Curtis dissimilarities between samples calculated from true community composition and from virome analysis, including 6 samples sequenced at 1%. Methods are similar as in (D).

Similar articles

Cited by

References

    1. Aguirre de Cárcer D, Angly FE, Alcamí A. Evaluation of viral genome assembly and diversity estimation in deep metagenomes. BMC Genomics. 2014;15(1):e368. doi: 10.1186/1471-2164-15-989. - DOI - PMC - PubMed
    1. Allers E, Moraru C, Duhaime MB, Beneze E, Solonenko N, Canosa JB, Amann R, Sullivan MB. Single-cell and population level viral infection dynamics revealed by phageFISH, a method to visualize intracellular and free viruses. Environmental Microbiology. 2013;15:2306–2318. doi: 10.1111/1462-2920.12100. - DOI - PMC - PubMed
    1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11 doi: 10.1186/gb-2010-11-10-r106. Article R106. - DOI - PMC - PubMed
    1. Angly FE, Willner D, Prieto-Davó A, Edwards RA, Schmieder R, Vega-Thurber R, Antonopoulos DA, Barott K, Cottrell MT, Desnues C, Dinsdale EA, Furlan M, Haynes M, Henn MR, Hu Y, Kirchman DL, McDole T, McPherson JD, Meyer F, Miller RM, Mundt E, Naviaux RK, Rodriguez-Mueller B, Stevens R, Wegley L, Zhang L, Zhu B, Rohwer F. The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLOS Computational Biology. 2009;5:e1000593. doi: 10.1371/journal.pcbi.1000593. - DOI - PMC - PubMed
    1. Aziz RK, Dwivedi B, Akhter S, Breitbart M, Edwards RA. Multidimensional metrics for estimating phage abundance, distribution, gene density, and sequence coverage in metagenomes. Frontiers in Microbiology. 2015;6 doi: 10.3389/fmicb.2015.00381. Article 381. - DOI - PMC - PubMed