. 2008 Jul 1;24(13):i41-9.

doi: 10.1093/bioinformatics/btn174.

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Yaniv Loewenstein¹, Elon Portugaly, Menachem Fromer, Michal Linial

Affiliations

PMID: 18586742
PMCID: PMC2718652
DOI: 10.1093/bioinformatics/btn174

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Yaniv Loewenstein et al. Bioinformatics. 2008.

. 2008 Jul 1;24(13):i41-9.

doi: 10.1093/bioinformatics/btn174.

Authors

Yaniv Loewenstein¹, Elon Portugaly, Menachem Fromer, Michal Linial

Affiliation

¹ School of Computer Science and Engineering, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel.

PMID: 18586742
PMCID: PMC2718652
DOI: 10.1093/bioinformatics/btn174

Abstract

Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets.

Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation.

Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families.

Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.

PubMed Disclaimer

Figures

**Fig. 1.**
Illustration of non-metric constraints imposed by BLAST similarities (edges) for three multidomain Swiss-Prot sequences. False transitivity is possible due to CSKP_HUMAN. The missing triangle edge is due to no biological relatedness and no sequence similarity. It violates the triangle inequality since e_jl ≰e_il+e_ij. Triangle-offending missing edges prevent UPGMA from falsely clustering unrelated proteins together, by increasing average cluster dissimilarities.

**Fig. 2.**
Sparse UPGMA. E is typically implemented as a minimum heap data structure, extended with logarithmic time removal of non-minimum elements. The output tree and cluster heights are given by e_ijs. The algorithm is correct for non-sparse inputs as well.

**Fig. 3.**
Multi-round MC-UPGMA scheme.

**Fig. 4.**
Multi-round MC-UPGMA clustering—the clusterer. This is a modified version of the code for *Sparse-UPGMA* (Fig. 2).

**Fig. 5.**
Illustration of *Multi-Round MC-UPGMA*. FINDANCESTOR uses a suitable data structure for disjoint sets to allow efficient parent look-up. LOOKAHEAD peeks at unloaded edges on disk, and loads only components of currently maintained edges. Afterwards, uncertain edges can be inferred as definitively missing. See text for definition of and ⊕ notations.

formula image — **Fig. 5.**
Illustration of *Multi-Round MC-UPGMA*. FINDANCESTOR uses a suitable data structure for disjoint sets to allow efficient parent look-up. LOOKAHEAD peeks at unloaded edges on disk, and loads only components of currently maintained edges. Afterwards, uncertain edges can be inferred as definitively missing. See text for definition of and ⊕ notations.

**Fig. 6.**
Sensitivity of the best cluster for single-linkage (x-axis) versus MC-UPGMA (y-axis) for large protein families—InterPro domains and families with at least 100 UniRef90 representatives. Diagonal line represents identity (x =y). Points above diagonal correspond to higher sensitivity for MC-UPGMA and vice versa. Average specificities are comparable across this set (MC-UPGMA= 0.90±0.15, single-linkage= 0.88±0.18).

**Fig. 7.**
Illustration of thick edges (grey) connecting clusters (blue), and thin input edges (red and black). A thick edge calculation for a pair of clusters C_i and C_j is demonstrated based on three possible components: known (≤λ) thin edges (black), unknown thin edges (due to the memory constraint) that do exist (red, dashed lines), and possible edges connecting C_i and C_j members but do not exist in the sparse input graph (green). Corresponding components in the calculation are color coded accordingly. It is impossible to distinguish non-existing edges from unknown edges without reading the whole input graph. The cluster graph is broken into two connected components in , as long as d_7,9 is missing.

See this image and copyright information in PMC

Cited by

Clustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering.
Lerato L, Niesler T. Lerato L, et al. PLoS One. 2015 Oct 30;10(10):e0141756. doi: 10.1371/journal.pone.0141756. eCollection 2015. PLoS One. 2015. PMID: 26517376 Free PMC article.
Ultra-fast sequence clustering from similarity networks with SiLiX.
Miele V, Penel S, Duret L. Miele V, et al. BMC Bioinformatics. 2011 Apr 22;12:116. doi: 10.1186/1471-2105-12-116. BMC Bioinformatics. 2011. PMID: 21513511 Free PMC article.
Radiomics: A Primer on Processing Workflow and Analysis.
Avery E, Sanelli PC, Aboian M, Payabvash S. Avery E, et al. Semin Ultrasound CT MR. 2022 Apr;43(2):142-146. doi: 10.1053/j.sult.2022.02.003. Epub 2022 Feb 12. Semin Ultrasound CT MR. 2022. PMID: 35339254 Free PMC article. Review.
Ultrafast clustering algorithms for metagenomic sequence analysis.
Li W, Fu L, Niu B, Wu S, Wooley J. Li W, et al. Brief Bioinform. 2012 Nov;13(6):656-68. doi: 10.1093/bib/bbs035. Epub 2012 Jul 6. Brief Bioinform. 2012. PMID: 22772836 Free PMC article.
A novel hierarchical clustering algorithm for gene sequences.
Wei D, Jiang Q, Wei Y, Wang S. Wei D, et al. BMC Bioinformatics. 2012 Jul 23;13:174. doi: 10.1186/1471-2105-13-174. BMC Bioinformatics. 2012. PMID: 22823405 Free PMC article.

See all "Cited by" articles

References

1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene ontology consortium. Nat. Genet. 2000;25:25–29. - PMC - PubMed
1. D'haeseleer P. How does gene expression clustering work? Nat. Biotechnol. 2005;23:1499–1501. - PubMed
1. Durbin R, et al. Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1999.
1. Finn R, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Affiliation

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources