ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

doi:10.1371/journal.pcbi.1005518

. 2017 Apr 24;13(4):e1005518.

doi: 10.1371/journal.pcbi.1005518. eCollection 2017 Apr.

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Yunpeng Cai¹, Wei Zheng², Jin Yao³, Yujie Yang¹, Volker Mai⁴, Qi Mao³, Yijun Sun^{2

3

5}

Affiliations

¹ Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
² Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America.
³ Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America.
⁴ Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America.
⁵ Department of Biostatistics, The State University of New York at Buffalo, Buffalo, New York, United States of America.

PMID: 28437450
PMCID: PMC5421816
DOI: 10.1371/journal.pcbi.1005518

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Yunpeng Cai et al. PLoS Comput Biol. 2017.

. 2017 Apr 24;13(4):e1005518.

doi: 10.1371/journal.pcbi.1005518. eCollection 2017 Apr.

Authors

Yunpeng Cai¹, Wei Zheng², Jin Yao³, Yujie Yang¹, Volker Mai⁴, Qi Mao³, Yijun Sun^{2

3

5}

Affiliations

¹ Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
² Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America.
³ Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America.
⁴ Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America.
⁵ Department of Biostatistics, The State University of New York at Buffalo, Buffalo, New York, United States of America.

PMID: 28437450
PMCID: PMC5421816
DOI: 10.1371/journal.pcbi.1005518

Abstract

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. (a) A toy example of a PBP tree and (b) its corresponding partitioning of a dataset and a space.**
The colors indicate different levels of nodes and their corresponding hyper-spheres. The leaf nodes are omitted in the tree. When searching for the nearest neighbor of a point (large red dot), only a small number of sibling hyper-spheres (filled circles) need to be explored.

**Fig 2. A toy example illustrating single-point (left) and multiple-point (right) hierarchical clustering by parallelizing uncorrelated operations.**
Each filled box represents a sequence and each circle represents a cluster-merging step. The numbers in the circle denote the order of merging operations. The merging orders may change when switching from single-point to multi-point clustering.

**Fig 3. Execution time of ESPRIT-Forest performed on a human gut microbiome dataset using a varying number of CPU cores ranging from 1 to 128.**
The clustering termination criterion was set to 85% sequence similarity. For comparison, the execution time of ESPRIT-Tree is also reported.

**Fig 4. Comparison of clustering quality of ESPRIT-Forest, ESPRIT-Tree and UPARSE performed on benchmark datasets using the species annotation as ground truth.**
(a) NMI scores calculated on human gut V2 dataset. (b) NMI scores calculated on human gut V6 dataset. (c) NMI scores calculated on ELDERMET dataset. (d) NMI scores calculated on HMP Saliva dataset.

**Fig 5. Comparison of clustering quality of ESPRIT-Tree (red) and ESPRIT-Forest (blue) on various distance cut-offs on the human gut V2 dataset.**
We see that the results of both algorithms agrees but with small variations caused by randomness in clustering.

See this image and copyright information in PMC

Cited by

MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.
Asgari E, Garakani K, McHardy AC, Mofrad MRK. Asgari E, et al. Bioinformatics. 2018 Jul 1;34(13):i32-i42. doi: 10.1093/bioinformatics/bty296. Bioinformatics. 2018. PMID: 29950008 Free PMC article.
DMSC: A Dynamic Multi-Seeds Method for Clustering 16S rRNA Sequences Into OTUs.
Wei ZG, Zhang SW. Wei ZG, et al. Front Microbiol. 2019 Mar 12;10:428. doi: 10.3389/fmicb.2019.00428. eCollection 2019. Front Microbiol. 2019. PMID: 30915052 Free PMC article.
Accurately clustering biological sequences in linear time by relatedness sorting.
Wright E. Wright E. Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9. Nat Commun. 2024. PMID: 38589369 Free PMC article.
Alignment-free comparison of metagenomics sequences via approximate string matching.
Chen J, Yang L, Li L, Goodison S, Sun Y. Chen J, et al. Bioinform Adv. 2022 Oct 21;2(1):vbac077. doi: 10.1093/bioadv/vbac077. eCollection 2022. Bioinform Adv. 2022. PMID: 36388153 Free PMC article.
A parallel computational framework for ultra-large-scale sequence clustering analysis.
Zheng W, Mao Q, Genco RJ, Wactawski-Wende J, Buck M, Cai Y, Sun Y. Zheng W, et al. Bioinformatics. 2019 Feb 1;35(3):380-388. doi: 10.1093/bioinformatics/bty617. Bioinformatics. 2019. PMID: 30010718 Free PMC article.

See all "Cited by" articles

References

1. Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biology. 2011;12(8):125 10.1186/gb-2011-12-8-125 - DOI - PMC - PubMed
1. Beerenwinkel N, Zagordi O. Ultra-deep sequencing for the analysis of viral populations. Current Opinion in Virology. 2011;1(5):413–418. 10.1016/j.coviro.2011.07.008 - DOI - PubMed
1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences. 2006;103(32):12115–12120. 10.1073/pnas.0605127103 - DOI - PMC - PubMed
1. O’Brien HE, Parrent JL, Jackson JA, Moncalvo JM, Vilgalys R. Fungal community analysis by large-scale sequencing of environmental samples. Applied and Environmental Microbiology. 2005;71(9):5544–5550. 10.1128/AEM.71.9.5544-5550.2005 - DOI - PMC - PubMed
1. López-García P, Rodríguez-Valera F, Pedrós-Alió C, Moreira D. Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton. Nature. 2001;409(6820):603–607. 10.1038/35054537 - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 AI125982/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

[1] Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biology. 2011;12(8):125 10.1186/gb-2011-12-8-125 - DOI - PMC - PubMed

[2] Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biology. 2011;12(8):125 10.1186/gb-2011-12-8-125 - DOI - PMC - PubMed

[3] Beerenwinkel N, Zagordi O. Ultra-deep sequencing for the analysis of viral populations. Current Opinion in Virology. 2011;1(5):413–418. 10.1016/j.coviro.2011.07.008 - DOI - PubMed

[4] Beerenwinkel N, Zagordi O. Ultra-deep sequencing for the analysis of viral populations. Current Opinion in Virology. 2011;1(5):413–418. 10.1016/j.coviro.2011.07.008 - DOI - PubMed

[5] Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences. 2006;103(32):12115–12120. 10.1073/pnas.0605127103 - DOI - PMC - PubMed

[6] Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences. 2006;103(32):12115–12120. 10.1073/pnas.0605127103 - DOI - PMC - PubMed

[7] O’Brien HE, Parrent JL, Jackson JA, Moncalvo JM, Vilgalys R. Fungal community analysis by large-scale sequencing of environmental samples. Applied and Environmental Microbiology. 2005;71(9):5544–5550. 10.1128/AEM.71.9.5544-5550.2005 - DOI - PMC - PubMed

[8] O’Brien HE, Parrent JL, Jackson JA, Moncalvo JM, Vilgalys R. Fungal community analysis by large-scale sequencing of environmental samples. Applied and Environmental Microbiology. 2005;71(9):5544–5550. 10.1128/AEM.71.9.5544-5550.2005 - DOI - PMC - PubMed

[9] López-García P, Rodríguez-Valera F, Pedrós-Alió C, Moreira D. Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton. Nature. 2001;409(6820):603–607. 10.1038/35054537 - DOI - PubMed

[10] López-García P, Rodríguez-Valera F, Pedrós-Alió C, Moreira D. Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton. Nature. 2001;409(6820):603–607. 10.1038/35054537 - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Affiliations

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous