Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007;35(7):2125-40.
doi: 10.1093/nar/gkl1114. Epub 2007 Mar 11.

Hierarchical classification of functionally equivalent genes in prokaryotes

Affiliations

Hierarchical classification of functionally equivalent genes in prokaryotes

Hongwei Wu et al. Nucleic Acids Res. 2007.

Abstract

Functional classification of genes represents a fundamental problem to many biological studies. Most of the existing classification schemes are based on the concepts of homology and orthology, which were originally introduced to study gene evolution but might not be the most appropriate for gene function prediction, particularly at high resolution level. We have recently developed a scheme for hierarchical classification of genes (HCGs) in prokaryotes. In the HCG scheme, the functional equivalence relationships among genes are first assessed through a careful application of both sequence similarity and genomic neighborhood information; and genes are then classified into a hierarchical structure of clusters, where genes in each cluster are functionally equivalent at some resolution level, and the level of resolution goes higher as the clusters become increasingly smaller traveling down the hierarchy. The HCG scheme is validated through comparisons with the taxonomy of the prokaryotic genomes, Clusters of Orthologous Groups (COGs) of genes and the Pfam system. We have applied the HCG scheme to 224 complete prokaryotic genomes, and constructed a HCG database consisting of a forest of 5339 multi-level and 15 770 single-level trees of gene clusters covering approximately 93% of the genes of these 224 genomes. The validation results indicate that the HCG scheme not only captures the key features of the existing classification schemes but also provides a much richer organization of genes which can be used for functional prediction of genes at higher resolution and to help reveal evolutionary trace of the genes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A graphical representation of 291 genes and their functional equivalence relationships (as measured by their BLASTP e-values). Each node represents a gene, and each edge indicates that the reciprocal BLASTP e-values between the two genes ≤1.0. The layout of the nodes and edges is generated by using the Pajek Software (http://vlado.fmf.uni-lj.si/pub/networks/pajek/), where the Euclidean distance between two genes and the darkness of their connecting edge are both roughly proportional to their BLASTP e-value. That is, the smaller their BLASTP e-value is, the closer their two nodes are located, and the darker their connecting edge is. Most of these genes encode the two-component system regulatory proteins of either the sporulation or the chemotaxis family. See Tables S-1.1–S-1.5 in the Supplementary Data for descriptions of those genes that do not have accompanying IDs. Based on their COG, GO, Pfam and NCBI annotations, these genes fall into five different groups, cheB (formula image), cheY (formula image), spo0A (formula image) spo0F (formula image), and genes without further specifications (▪). Each dotted ellipse contains genes that form a cluster via the guilty-by-association rule when a certain percentage of insignificant (bottom) edges are removed, where an edge is less significant if it is associated with a higher BLASTP e-value. See Figure S-1 in the Supplementary Data for additional information of these genes and their functional equivalence relationships.
Figure 2.
Figure 2.
A flowchart of the procedure for establishing the HCG.
Figure 3.
Figure 3.
The MST-based hierarchical clustering algorithm: the first step is to determine the sequential representation of a graph through constructing a MST using Prim's algorithm, and the second step is to search for the valleys in the sequential representation.
Figure 4.
Figure 4.
(A) Distribution of the number of clusters per tree, where the parameters for the power-law function are A = 16 379 and k = 2.51, and the correlation coefficient between the power-law function and the real distribution curve is greater than 0.995; and (B) distribution of the depth of a cluster tree, where the parameters for the power-law function are A = 17 467 and k = 2.62, and the correlation coefficient between the power-law function and the real distribution curve is greater than 0.969.
Figure 5.
Figure 5.
A hierarchical structure formed by HCG-10 and its descendant clusters, where most of the genes belonging to HCG-10 are annotated as DNA-binding regulatory genes. Each rectangular or circular node corresponds to a cluster, whereas a triangular node represents a group of genes that cannot be further clustered. The shape of a node reflects whether the cluster contains multiple genes from the same genome, with the rectangular standing for yes and the circular standing for no. The color of a node reflects the taxonomic lineages of the genomes being covered by the cluster, where a solid color represents that all the genomes being covered belong to the same taxonomic lineage for which the color stands, and a color with white interior represents that most (but not all) of the genomes being covered belong to the taxonomic lineage for which the color stands. The annotations accompanying the clusters are summarized from the NCBI annotations of the genes being included.
Figure 6.
Figure 6.
Genes of HCG-10.7 and their functional equivalence relationships (as measured by their f (·, ·) values), as represented by nodes and edges, respectively. The layout of the nodes and edges is generated using the Pajek Software, where both the Euclidean distance between two genes and the darkness of their connecting edge are roughly proportional to their f (·, ·) value. That is, the larger their f (·, ·) value is, the closer their two nodes are located, and the darker their connecting edge is. The red-colored genes, most of which are annotated as basR/pmrA, and the blue-colored genes, most of which are annotated as ygiX/qseB, are grouped into two different child clusters of HCG-10.7; whereas, the green-colored genes are those that cannot be further grouped.
Figure 7.
Figure 7.
Tree structure formed by cluster HCG-424 and its descendant clusters, where all the 292 genes belonging to HCG-424 are annotated as ribonucleotide reductase genes. Each rectangular or circular node corresponds to a cluster, whereas a triangular node represents a group of genes that cannot be further clustered. The shape of a node reflects whether the cluster contains multiple genes from the same genome, with the rectangular standing for yes and the circular standing for no. The color of a node reflects the taxonomic lineages of the genomes being covered by the cluster, where a solid color represents that all the genomes being covered belong to the same taxonomic lineage for which the color stands, and a color with white interior represents that most (but not all) of the genomes being covered belong to the taxonomic lineage for which the color stands. The annotations accompanying the clusters are summarized from the NCBI annotations of the genes being included.
Figure 8.
Figure 8.
The distribution of dtaxonomy(g1, g2) for all the genes being covered by the HCG prediction, where SK stands for super-kingdom, and beyond means that two genes do not even belong to the same super-kingdom.
Figure 9.
Figure 9.
The distribution of dtaxonomy(g1, g2) at the root, middle and leaf levels of the HCG, relative to the background distribution of dtaxonomy(g1, g2), where SK stands for super-kingdom, and beyond means that two genes do not even belong to the same super-kingdom. Each bin represents the ratio between the percentage of the gene pairs at a particular HCG level and the percentage of the background gene pairs that have the same taxonomic distance level.

Similar articles

Cited by

References

    1. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
    1. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. - PMC - PubMed
    1. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. - PMC - PubMed
    1. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. - PubMed
    1. Storm CE, Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. - PubMed

Publication types

MeSH terms

Substances