Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 24;13(4):e1005518.
doi: 10.1371/journal.pcbi.1005518. eCollection 2017 Apr.

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Affiliations

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Yunpeng Cai et al. PLoS Comput Biol. .

Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. (a) A toy example of a PBP tree and (b) its corresponding partitioning of a dataset and a space.
The colors indicate different levels of nodes and their corresponding hyper-spheres. The leaf nodes are omitted in the tree. When searching for the nearest neighbor of a point (large red dot), only a small number of sibling hyper-spheres (filled circles) need to be explored.
Fig 2
Fig 2. A toy example illustrating single-point (left) and multiple-point (right) hierarchical clustering by parallelizing uncorrelated operations.
Each filled box represents a sequence and each circle represents a cluster-merging step. The numbers in the circle denote the order of merging operations. The merging orders may change when switching from single-point to multi-point clustering.
Fig 3
Fig 3. Execution time of ESPRIT-Forest performed on a human gut microbiome dataset using a varying number of CPU cores ranging from 1 to 128.
The clustering termination criterion was set to 85% sequence similarity. For comparison, the execution time of ESPRIT-Tree is also reported.
Fig 4
Fig 4. Comparison of clustering quality of ESPRIT-Forest, ESPRIT-Tree and UPARSE performed on benchmark datasets using the species annotation as ground truth.
(a) NMI scores calculated on human gut V2 dataset. (b) NMI scores calculated on human gut V6 dataset. (c) NMI scores calculated on ELDERMET dataset. (d) NMI scores calculated on HMP Saliva dataset.
Fig 5
Fig 5. Comparison of clustering quality of ESPRIT-Tree (red) and ESPRIT-Forest (blue) on various distance cut-offs on the human gut V2 dataset.
We see that the results of both algorithms agrees but with small variations caused by randomness in clustering.

Similar articles

Cited by

References

    1. Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biology. 2011;12(8):125 10.1186/gb-2011-12-8-125 - DOI - PMC - PubMed
    1. Beerenwinkel N, Zagordi O. Ultra-deep sequencing for the analysis of viral populations. Current Opinion in Virology. 2011;1(5):413–418. 10.1016/j.coviro.2011.07.008 - DOI - PubMed
    1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences. 2006;103(32):12115–12120. 10.1073/pnas.0605127103 - DOI - PMC - PubMed
    1. O’Brien HE, Parrent JL, Jackson JA, Moncalvo JM, Vilgalys R. Fungal community analysis by large-scale sequencing of environmental samples. Applied and Environmental Microbiology. 2005;71(9):5544–5550. 10.1128/AEM.71.9.5544-5550.2005 - DOI - PMC - PubMed
    1. López-García P, Rodríguez-Valera F, Pedrós-Alió C, Moreira D. Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton. Nature. 2001;409(6820):603–607. 10.1038/35054537 - DOI - PubMed

Publication types

Substances