Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Affiliations

¹ Department of Pediatrics, University of California San Diego, La Jolla, California, USA.
² Department of Pediatrics, University of California San Diego, La Jolla, California, USA; Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA.
³ Department of Applied Mathematics, and Interdisciplinary Quantitative Biology Graduate Program, University of Colorado Boulder, Boulder, Colorado, USA.
⁴ Department of Pediatrics, University of California San Diego, La Jolla, California, USA; Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA; Center for Microbiome Innovation, University of California San Diego, San Diego, California, USA.

PMID: 28289731
PMCID: PMC5340863
DOI: 10.1128/mSystems.00191-16

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Amnon Amir et al. mSystems. 2017.

. 2017 Mar 7;2(2):e00191-16.

doi: 10.1128/mSystems.00191-16. eCollection 2017 Mar-Apr.

Affiliations

¹ Department of Pediatrics, University of California San Diego, La Jolla, California, USA.
² Department of Pediatrics, University of California San Diego, La Jolla, California, USA; Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA.
³ Department of Applied Mathematics, and Interdisciplinary Quantitative Biology Graduate Program, University of Colorado Boulder, Boulder, Colorado, USA.
⁴ Department of Pediatrics, University of California San Diego, La Jolla, California, USA; Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA; Center for Microbiome Innovation, University of California San Diego, San Diego, California, USA.

PMID: 28289731
PMCID: PMC5340863
DOI: 10.1128/mSystems.00191-16

Abstract

High-throughput sequencing of 16S ribosomal RNA gene amplicons has facilitated understanding of complex microbial communities, but the inherent noise in PCR and DNA sequencing limits differentiation of closely related bacteria. Although many scientific questions can be addressed with broad taxonomic profiles, clinical, food safety, and some ecological applications require higher specificity. Here we introduce a novel sub-operational-taxonomic-unit (sOTU) approach, Deblur, that uses error profiles to obtain putative error-free sequences from Illumina MiSeq and HiSeq sequencing platforms. Deblur substantially reduces computational demands relative to similar sOTU methods and does so with similar or better sensitivity and specificity. Using simulations, mock mixtures, and real data sets, we detected closely related bacterial sequences with single nucleotide differences while removing false positives and maintaining stability in detection, suggesting that Deblur is limited only by read length and diversity within the amplicon sequences. Because Deblur operates on a per-sample level, it scales to modern data sets and meta-analyses. To highlight Deblur's ability to integrate data sets, we include an interactive exploration of its application to multiple distinct sequencing rounds of the American Gut Project. Deblur is open source under the Berkeley Software Distribution (BSD) license, easily installable, and downloadable from https://github.com/biocore/deblur. IMPORTANCE Deblur provides a rapid and sensitive means to assess ecological patterns driven by differentiation of closely related taxa. This algorithm provides a solution to the problem of identifying real ecological differences between taxa whose amplicons differ by a single base pair, is applicable in an automated fashion to large-scale sequencing data sets, and can integrate sequencing runs collected over time.

Keywords: DNA sequencing; microbiome.

PubMed Disclaimer

Figures

**FIG 1**
A principal-coordinate analysis plot of UniFrac distances from *de novo* OTUs as visualized by Emperor. A subset of American Gut Project samples spanning sequencing centers and rounds were selected. UCLUST (3) was run independently per round via QIIME. The resulting OTU tables were merged, normalizing sequencing identifiers (IDs) such that if the same sequence was observed in multiple rounds it would receive the same ID. Observations with fewer than 10 counts were dropped. The data were rarefied to 5,000 sequences per sample. The plot shown is based on unweighted UniFrac distances, and the samples are colored by the sequencing center. An interactive visualization can be viewed at https://nbviewer.jupyter.org/github/knightlab-analyses/deblur-manuscript/blob/master/embedded_figure_1.ipynb; the coloring used in the static image can be done by selecting “run_center” as the scatter field. CU, University of Colorado Boulder; ANL, Argonne National Laboratory; UCSD, University of California San Diego.

**FIG 2**
Benchmarks of OTU picking tools on artificial communities. (A) A simulation was performed on the basis of samples from a real fecal community (11) using the 52 most abundant bacterial species identified in this study. Reads were then simulated using an ART Illumina (12) read simulator. OTU picking was performed on these simulated reads using UNOISE2, DADA2, and Deblur. The relative abundances predicted by each of these tools and the ground truth (GT) are shown in the heat map. The dendrogram was built using hierarchical clustering based on the Hamming distance between the sequences, with numbers indicating sequence similarity (log scale). (B) Simulated communities with various levels of sequence-sequence similarity. Unweighted UniFrac distances of the predicted OTUs from UNOISE2, DADA2, and Deblur were compared to those of the original composition of the simulated communities. The x axis denotes the similarity radius for each community. The shaded area denotes the standard error of the mean distance estimation (based on 10 random repeats per community). (C) Similar to panel B but with the ratio of observed OTUs (predicted by UNOISE2, DADA2, and Deblur) to actual OTUs in each simulation indicated. (D) Performance of Deblur, UNOISE2, and DADA2 on the even1 community from mock-3 (14). GT data denote the expected ground truth relative frequency for each sOTU as informed by the design of the mock community. Dendrograms and colors are the same as described for panel A.

**FIG 3**
Benchmarks of OTU picking tools on natural communities. (A) Stability analysis on experimental technical repeats. Data indicate fractions of overlapping sOTUs from two technical replicates in all OTUs as a function of the minimal frequency threshold present in one of the repeats. (B and C) Application of Deblur in the howler monkey data set. (B) Fraction of sequences matching entries in the NCBI nr/nt database (as of 1 December 2016) with 0.1 or 2 mismatches (red, green, or blue, respectively) from sOTUs unique to Deblur or to DADA2 or present in both (left to right). (C) Heat maps showing sOTUs (rows) in common with Deblur and DADA2, as well as those unique to Deblur and DADA2 (bottom, middle, and top rows, respectively). Samples (columns) are sorted by species and habitat. A total of 200 sOTUs per group (i.e., common, unique to Deblur, or unique to DADA2) were randomly selected for visualization purposes. (D) Single-threaded runtime comparison of Deblur, DADA2, and UNOISE2 against one of the stability MiSeq runs at increasing numbers of samples.

**FIG 4**
A principal-coordinate analysis plot of UniFrac distances from Deblur as visualized by Emperor. A subset of American Gut Project samples spanning sequencing centers and rounds were selected. Each sample was processed separately by Deblur. Observations with fewer than 10 counts were dropped. The data were rarefied to 5,000 sequences per sample. The plot shown is based on unweighted UniFrac distances and is colored according to the round of sequencing in the American Gut Project (AG). An interactive visualization can be viewed at https://nbviewer.jupyter.org/github/knightlab-analyses/deblur-manuscript/blob/master/embedded_figure_4.ipynb; the coloring used in the static image can be made by selecting the “center_project_name” as the scatter field.

**FIG 5**
A principal-coordinate analysis plot of UniFrac distances from UNOISE2 as visualized by Emperor. A subset of American Gut Project samples spanning sequencing centers and rounds were selected. UNOISE2 was run independently per round. The resulting sOTU tables were merged, normalizing sequencing IDs such that if the same sequence were observed in multiple rounds it would receive the same ID. Observations with fewer than 10 counts were dropped. The data were rarefied to 5,000 sequences per sample. The plot shown is based on unweighted UniFrac distances and is colored according to the round of sequencing in the American Gut Project. An interactive visualization can be viewed at https://nbviewer.jupyter.org/github/knightlab-analyses/deblur-manuscript/blob/master/embedded_figure_5.ipynb; the coloring used in the static image can be made by selecting the “center_project_name” as the scatter field. The static shot is oriented to show PC1 versus PC2, and the separation is more pronounced if orienting the projection to look at PC2 versus PC3.

See this image and copyright information in PMC

References

1. Glenn TC. 2011. Field guide to next-generation DNA sequencers. Mol Ecol Resour 11:759–769. doi: 10.1111/j.1755-0998.2011.03024.x. - DOI - PubMed
1. Schloss PD, Handelsman J. 2005. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 71:1501–1506. doi: 10.1128/AEM.71.3.1501-1506.2005. - DOI - PMC - PubMed
1. Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461. doi: 10.1093/bioinformatics/btq461. - DOI - PubMed
1. Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, Chase J, McDonald D, Gonzalez A, Robbins-Pianka A, Clemente JC, Gilbert JA, Huse SM, Zhou HW, Knight R, Caporaso JG. 2014. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ 2:e545. doi: 10.7717/peerj.545. - DOI - PMC - PubMed
1. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. 2011. Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12:38. doi: 10.1186/1471-2105-12-38. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Affiliations

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources