Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Aug;39(14):e95.
doi: 10.1093/nar/gkr349. Epub 2011 May 19.

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time

Affiliations

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time

Yunpeng Cai et al. Nucleic Acids Res. 2011 Aug.

Abstract

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
PBP tree. A PBP tree partitions an input space into a set of non-overlapped hyperspherical regions at various distance levels indicated by circles with different colors (a), and organizes input sequences in a tree-like hierarchical structure (b). Each point in (a) represents a sequence, and the color of a node in (b) corresponds to the color of a circle in (a). For ease of presentation, the leaf nodes are omitted, and a root node is created that includes all descendent nodes. The partition, though not necessarily reflecting the true structure of a data set as shown in (a), can significantly accelerate a clustering process by removing most unnecessary sequence alignment operations and distance computation.
Figure 2.
Figure 2.
(a) NMI scores of four methods evaluated at 10 distance levels. (b) Box-plot of the maximum NMI scores of four methods. The species assignments of input sequences were used as ground truth.
Figure 3.
Figure 3.
(a) NMI scores of four methods evaluated at 14 distance levels. (b) Box-plot of the maximum NMI scores of four methods. The genus assignments of input sequences were used as ground truth.
Figure 4.
Figure 4.
Scalability of ESPRIT-Tree, CDHIT and UCLUST performed on a human gut microbiota data set with a varying number of sequences ranging from 1K to 1.1M. The empirical complexity and confidence interval (CI) are also reported.

Similar articles

Cited by

References

    1. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, et al. The NIH human Microbiome Project. Genome Res. 2009;19:2317–2323. - PMC - PubMed
    1. Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA. Diversity of the human intestinal microbial flora. Science. 2005;308:1635–1638. - PMC - PubMed
    1. Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML. Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet. 2008;4:e1000255. - PMC - PubMed
    1. Fabrice A, Didier R. Exploring microbial diversity using 16S rRNA high-throughput methods. J. Comput. Sci. Syst. Biol. 2009;2:74–92.
    1. Dethlefsen L, Huse S, Sogin ML, Relman DA. The pervasive effects of antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLOS Biol. 2008;6:e280. - PMC - PubMed

Publication types

Substances