Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jul 1;24(13):i41-9.
doi: 10.1093/bioinformatics/btn174.

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Affiliations

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Yaniv Loewenstein et al. Bioinformatics. .

Abstract

Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets.

Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation.

Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families.

Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Illustration of non-metric constraints imposed by BLAST similarities (edges) for three multidomain Swiss-Prot sequences. False transitivity is possible due to CSKP_HUMAN. The missing triangle edge is due to no biological relatedness and no sequence similarity. It violates the triangle inequality since ejleil+eij. Triangle-offending missing edges prevent UPGMA from falsely clustering unrelated proteins together, by increasing average cluster dissimilarities.
Fig. 2.
Fig. 2.
Sparse UPGMA. E is typically implemented as a minimum heap data structure, extended with logarithmic time removal of non-minimum elements. The output tree and cluster heights are given by eijs. The algorithm is correct for non-sparse inputs as well.
Fig. 3.
Fig. 3.
Multi-round MC-UPGMA scheme.
Fig. 4.
Fig. 4.
Multi-round MC-UPGMA clustering—the clusterer. This is a modified version of the code for Sparse-UPGMA (Fig. 2).
Fig. 5.
Fig. 5.
Illustration of Multi-Round MC-UPGMA. FINDANCESTOR uses a suitable data structure for disjoint sets to allow efficient parent look-up. LOOKAHEAD peeks at unloaded edges on disk, and loads only components of currently maintained edges. Afterwards, uncertain edges can be inferred as definitively missing. See text for definition of formula image and ⊕ notations.
Fig. 6.
Fig. 6.
Sensitivity of the best cluster for single-linkage (x-axis) versus MC-UPGMA (y-axis) for large protein families—InterPro domains and families with at least 100 UniRef90 representatives. Diagonal line represents identity (x =y). Points above diagonal correspond to higher sensitivity for MC-UPGMA and vice versa. Average specificities are comparable across this set (MC-UPGMA= 0.90±0.15, single-linkage= 0.88±0.18).
Fig. 7.
Fig. 7.
Illustration of thick edges (grey) connecting clusters (blue), and thin input edges (red and black). A thick edge calculation for a pair of clusters Ci and Cj is demonstrated based on three possible components: known (≤λ) thin edges (black), unknown thin edges (due to the memory constraint) that do exist (red, dashed lines), and possible edges connecting Ci and Cj members but do not exist in the sparse input graph (green). Corresponding components in the calculation are color coded accordingly. It is impossible to distinguish non-existing edges from unknown edges without reading the whole input graph. The cluster graph is broken into two connected components in formula image, as long as d7,9 is missing.

Similar articles

Cited by

References

    1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene ontology consortium. Nat. Genet. 2000;25:25–29. - PMC - PubMed
    1. D'haeseleer P. How does gene expression clustering work? Nat. Biotechnol. 2005;23:1499–1501. - PubMed
    1. Durbin R, et al. Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1999.
    1. Finn R, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. - PMC - PubMed

Publication types