Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 10;3(3):e00039-18.
doi: 10.1128/mSystems.00039-18. eCollection 2018 May-Jun.

Nonpareil 3: Fast Estimation of Metagenomic Coverage and Sequence Diversity

Affiliations

Nonpareil 3: Fast Estimation of Metagenomic Coverage and Sequence Diversity

Luis M Rodriguez-R et al. mSystems. .

Abstract

Estimations of microbial community diversity based on metagenomic data sets are affected, often to an unknown degree, by biases derived from insufficient coverage and reference database-dependent estimations of diversity. For instance, the completeness of reference databases cannot be generally estimated since it depends on the extant diversity sampled to date, which, with the exception of a few habitats such as the human gut, remains severely undersampled. Further, estimation of the degree of coverage of a microbial community by a metagenomic data set is prohibitively time-consuming for large data sets, and coverage values may not be directly comparable between data sets obtained with different sequencing technologies. Here, we extend Nonpareil, a database-independent tool for the estimation of coverage in metagenomic data sets, to a high-performance computing implementation that scales up to hundreds of cores and includes, in addition, a k-mer-based estimation as sensitive as the original alignment-based version but about three hundred times as fast. Further, we propose a metric of sequence diversity (Nd ) derived directly from Nonpareil curves that correlates well with alpha diversity assessed by traditional metrics. We use this metric in different experiments demonstrating the correlation with the Shannon index estimated on 16S rRNA gene profiles and show that Nd additionally reveals seasonal patterns in marine samples that are not captured by the Shannon index and more precise rankings of the magnitude of diversity of microbial communities in different habitats. Therefore, the new version of Nonpareil, called Nonpareil 3, advances the toolbox for metagenomic analyses of microbiomes. IMPORTANCE Estimation of the coverage provided by a metagenomic data set, i.e., what fraction of the microbial community was sampled by DNA sequencing, represents an essential first step of every culture-independent genomic study that aims to robustly assess the sequence diversity present in a sample. However, estimation of coverage remains elusive because of several technical limitations associated with high computational requirements and limiting statistical approaches to quantify diversity. Here we described Nonpareil 3, a new bioinformatics algorithm that circumvents several of these limitations and thus can facilitate culture-independent studies in clinical or environmental settings, independent of the sequencing platform employed. In addition, we present a new metric of sequence diversity based on rarefied coverage and demonstrate its use in communities from diverse ecosystems.

Keywords: bioinformatics; coverage; metagenomics; microbial ecology.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Speedup per number of processors in Nonpareil. Nonpareil estimations of the LL_1007B data set (3) were performed with alignment kernel and default parameters in multiple processors of a node (light blue), a single processor of multiple nodes (pink), and four processors of multiple nodes (dark green). The base time (one processor of one node) was 82.98 h.
FIG 2
FIG 2
Comparison of Nonpareil Nd sequence diversity and 16S rRNA gene OTU Shannon H′ taxonomic diversity indices on 90 metagenomes. Each data point represents the estimates on Nd (x axis) and H′ (y axis). The y-axis value of each point indicates the Bayesian analysis-corrected Shannon index, and the line extending from the low part of each data point represents the exact observed (maximum-likelihood) Shannon index. The color of each point indicates the type of biome of each data set, the shape indicates the sequencing platform, and the size indicates the estimated coverage of the 16S rRNA gene profile (Turing-Good estimate). For each biome, the IQR of both estimates is represented as semitransparent rectangles. The least-squares linear correlation model is represented in gray, including the central estimate (solid line), the 95% confidence interval (dashed-line band), and the 80 and 95% prediction intervals (dotted-line bands). Labeled data sets fell outside the 80% prediction interval. The inset shows the residuals from the linear model against the Turing-Good estimate of 16S rRNA gene coverage.

References

    1. Rodriguez-R LM, Konstantinidis KT. 2015. Estimating coverage in metagenomic data sets and why it matters. ISME J 9:1053–1061. doi:10.1038/ismej.2014.207. - DOI - PMC - PubMed
    1. Rodriguez-R LM, Castro JC, Kyrpides NC, Cole JR, Tiedje JM, Konstantinidis KT. 2018. How much do rRNA gene surveys underestimate extant bacterial diversity? Appl Environ Microbiol 84:00014–00018. doi:10.1128/AEM.00014-18. - DOI - PMC - PubMed
    1. Rodriguez-R LM, Konstantinidis KT. 2014. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30:629–635. doi:10.1093/bioinformatics/btt584. - DOI - PubMed
    1. Tamames J, de la Peña S, de Lorenzo V. 2012. COVER: a priori estimation of coverage for metagenomic sequencing. Environ Microbiol Rep 4:335–341. doi:10.1111/j.1758-2229.2012.00338.x. - DOI - PubMed
    1. Větrovský T, Baldrian P. 2013. The variability of the 16S rRNA gene in bacterial genomes and its consequences for bacterial community analyses. PLoS One 8:e57923. doi:10.1371/journal.pone.0057923. - DOI - PMC - PubMed