Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jan 25;34(2):647-58.
doi: 10.1093/nar/gkj448. Print 2006.

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes

Affiliations

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes

Ikuo Uchiyama. Nucleic Acids Res. .

Abstract

Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hierarchical clustering as graph contraction. The best similarity edge is indicated by the thick line, and the missing edges to which a fixed score is assigned are indicated by the broken lines.
Figure 2
Figure 2
Overview of the domain splitting procedure. (A) A similarity graph that has 7 clusters (vertices) containing 12 sequences, which are constituted from six domains, A–F, each of which is 100 residues long. The numbers in brackets on each edge indicate the coordinates of the aligned segments. The best similarity edge, sim(s1,s2), that is selected for merging is colored red. At the bottom is the schematic illustration of the alignment between s1 and s2. (B) A similarity graph, after the two clusters have been merged and split. The newly created nodes are colored red. (C) The resulting clustering tree. The process of merging ABCD and BCDE, at the center of the figure, corresponds to the process shown in (A) and (B). In this case, CD is not actually split, because D is always adjacent to C, and neither is EF.
Figure 3
Figure 3
Detail of the domain-splitting procedure. Here, s1 and s2 are the sequences to be merged, and s3 is a sequence similar to either s1 or s2. The aligned segments on each sequence are indicated by rectangles which have the same pattern for corresponding segments. (A) Condition I: the domain boundaries will be extended. (B) Condition II: the sequence (in this case, s1) will be split. (C) Updating the alignment information. In this example, s1 and s2 (upper half) are merged and split into sM and sR1 (lower half). The thick lines indicate the extended regions. See text for details.
Figure 4
Figure 4
Post-processing of the clustering algorithm. (A) The phylogenetic-tree cutting procedure. The procedure recursively checks whether child clusters of each root node share intra-species paralogs. The three-letter code on each leaf indicates the species name, and the number at each internal node indicates the number of species shared by its child clusters (|Ph(A) ∩ Ph(B)|). (B) Rejoining adjacent clusters. Two clusters, A and B, are joined when most of the segments (cutoff ratio radj1) belonging to each cluster are adjacent to each other [the segment set adj(A,B)], or almost all of the segments (cutoff ratio radj2) belonging to the smaller clusters are adjacent to the other cluster.
Figure 5
Figure 5
Orthologous groups of RNA polymerase beta (RpoB) and beta' (RpoC) subunits. (A) Hierarchical clustering trees created by the DomClust program. Each domain is drawn in a different color. An abbreviated species name (taken from the COG database; see Supplementary Table S2) is shown on each leaf, which is colored according to the kingdom: salmon, bacteria; khaki, archaea; sky-blue, eukaryotes. (B) Schematic illustration of the gene structures of RpoB and RpoC in selected genomes.
Figure 6
Figure 6
Comparison of the various clustering methods with the COG recovery test, where evaluation was done according to the numbers of exact, phylo-, 80%- and 60%-MGPs against the 2360 WDOG reference set.
Figure 7
Figure 7
Stability of the clustering methods measured through a comparison of the clusters created from the COG02 and COG03 datasets. Five automatic classification methods and the original COG database (COG) were examined. (A) The ratios of the number of clusters created from the COG02 datasets classified into each category. (B) The ratios of the number of entries belonging to the clusters classified into each category. Note that ‘domain-merged’ and ‘domain-divided’ are counted only in DomClust and COG.
Figure 8
Figure 8
Orthologous groups containing the E.coli ModE protein, an example of DomClust groups whose domain organizations were changed between the COG02 and COG03 datasets. These clusters were generated using the cutoff score of 80. (A) Domain organization of the ModE orthologs generated from the COG02 dataset. (B) Domain organization of the ModE orthologs generated from the COG03 dataset. The genes added to the COG03 database but not to the COG02 database are indicated by italic letters. (C) Hierarchical clustering tree for Group A and Group B. The root node of each group is encircled. The intraspecies paralogs appearing in both subtrees of the root node of Group B are indicated by bold green text. Here, Hin (Haemophilus influenzae) and Pmu (Pasteurella multocida) are counted only once, since they are closely related organisms. Consequently, the effective number of overlapping organisms is 2, just half of the total number of organisms in each subtree. This does not satisfy the phylogenetic-tree cutting criterion with the cutoff value of 0.5. (D) Hierarchical clustering tree for Group C and Group D. The horizontal bars indicate the points of tree cutting, which generated Group C and Group D. The intraspecies paralogs appearing in both subtrees of the original root node of Group D are indicated by bold green text. In this case, the effective number of overlapping organisms is 4, which exceeds the cutoff of the tree cutting.

References

    1. Liu J., Rost B. Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol. 2003;7:5–11. - PubMed
    1. Ouzounis C.A., Coulson R.M., Enright A.J., Kunin V., Pereira-Leal J.B. Classification schemes for protein structure and function. Nature Rev. Genet. 2003;4:508–519. - PubMed
    1. Sonnhammer E.L., Eddy S.R., Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–420. - PubMed
    1. Schultz J., Milpetz F., Bork P., Ponting C.P. SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. Sci. USA. 1998;95:5857–5864. - PMC - PubMed
    1. Sasson O., Vaaknin A., Fleischer H., Portugaly E., Bilu Y., Linial N., Linial M. ProtoNet: hierarchical classification of the protein space. Nucleic Acids Res. 2003;31:348–352. - PMC - PubMed

Publication types