. 2007;35(7):2125-40.

doi: 10.1093/nar/gkl1114. Epub 2007 Mar 11.

Hierarchical classification of functionally equivalent genes in prokaryotes

Hongwei Wu¹, Fenglou Mao, Victor Olman, Ying Xu

Affiliations

Affiliation

¹ Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.

PMID: 17353185
PMCID: PMC1874638
DOI: 10.1093/nar/gkl1114

Hierarchical classification of functionally equivalent genes in prokaryotes

Hongwei Wu et al. Nucleic Acids Res. 2007.

. 2007;35(7):2125-40.

doi: 10.1093/nar/gkl1114. Epub 2007 Mar 11.

Authors

Hongwei Wu¹, Fenglou Mao, Victor Olman, Ying Xu

Affiliation

¹ Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.

PMID: 17353185
PMCID: PMC1874638
DOI: 10.1093/nar/gkl1114

Abstract

Functional classification of genes represents a fundamental problem to many biological studies. Most of the existing classification schemes are based on the concepts of homology and orthology, which were originally introduced to study gene evolution but might not be the most appropriate for gene function prediction, particularly at high resolution level. We have recently developed a scheme for hierarchical classification of genes (HCGs) in prokaryotes. In the HCG scheme, the functional equivalence relationships among genes are first assessed through a careful application of both sequence similarity and genomic neighborhood information; and genes are then classified into a hierarchical structure of clusters, where genes in each cluster are functionally equivalent at some resolution level, and the level of resolution goes higher as the clusters become increasingly smaller traveling down the hierarchy. The HCG scheme is validated through comparisons with the taxonomy of the prokaryotic genomes, Clusters of Orthologous Groups (COGs) of genes and the Pfam system. We have applied the HCG scheme to 224 complete prokaryotic genomes, and constructed a HCG database consisting of a forest of 5339 multi-level and 15 770 single-level trees of gene clusters covering approximately 93% of the genes of these 224 genomes. The validation results indicate that the HCG scheme not only captures the key features of the existing classification schemes but also provides a much richer organization of genes which can be used for functional prediction of genes at higher resolution and to help reveal evolutionary trace of the genes.

PubMed Disclaimer

Figures

**Figure 1.**
A graphical representation of 291 genes and their functional equivalence relationships (as measured by their BLASTP e-values). Each node represents a gene, and each edge indicates that the reciprocal BLASTP e-values between the two genes ≤1.0. The layout of the nodes and edges is generated by using the Pajek Software (http://vlado.fmf.uni-lj.si/pub/networks/pajek/), where the Euclidean distance between two genes and the darkness of their connecting edge are both roughly proportional to their BLASTP e-value. That is, the smaller their BLASTP e-value is, the closer their two nodes are located, and the darker their connecting edge is. Most of these genes encode the two-component system regulatory proteins of either the *sporulation* or the *chemotaxis* family. See Tables S-1.1–S-1.5 in the Supplementary Data for descriptions of those genes that do not have accompanying IDs. Based on their COG, GO, Pfam and NCBI annotations, these genes fall into five different groups, *cheB* (), *cheY* (), *spo*0A () *spo*0F (), and genes without further specifications (▪). Each dotted ellipse contains genes that form a cluster via the *guilty-by-association* rule when a certain percentage of insignificant (bottom) edges are removed, where an edge is less significant if it is associated with a higher BLASTP e-value. See Figure S-1 in the Supplementary Data for additional information of these genes and their functional equivalence relationships.

formula image — **Figure 1.**
A graphical representation of 291 genes and their functional equivalence relationships (as measured by their BLASTP e-values). Each node represents a gene, and each edge indicates that the reciprocal BLASTP e-values between the two genes ≤1.0. The layout of the nodes and edges is generated by using the Pajek Software (http://vlado.fmf.uni-lj.si/pub/networks/pajek/), where the Euclidean distance between two genes and the darkness of their connecting edge are both roughly proportional to their BLASTP e-value. That is, the smaller their BLASTP e-value is, the closer their two nodes are located, and the darker their connecting edge is. Most of these genes encode the two-component system regulatory proteins of either the *sporulation* or the *chemotaxis* family. See Tables S-1.1–S-1.5 in the Supplementary Data for descriptions of those genes that do not have accompanying IDs. Based on their COG, GO, Pfam and NCBI annotations, these genes fall into five different groups, *cheB* (), *cheY* (), *spo*0A () *spo*0F (), and genes without further specifications (▪). Each dotted ellipse contains genes that form a cluster via the *guilty-by-association* rule when a certain percentage of insignificant (bottom) edges are removed, where an edge is less significant if it is associated with a higher BLASTP e-value. See Figure S-1 in the Supplementary Data for additional information of these genes and their functional equivalence relationships.

**Figure 2.**
A flowchart of the procedure for establishing the HCG.

**Figure 3.**
The MST-based hierarchical clustering algorithm: the first step is to determine the sequential representation of a graph through constructing a MST using Prim's algorithm, and the second step is to search for the valleys in the sequential representation.

**Figure 4.**
(A) Distribution of the number of clusters per tree, where the parameters for the power-law function are A = 16 379 and k = 2.51, and the correlation coefficient between the power-law function and the real distribution curve is greater than 0.995; and (B) distribution of the depth of a cluster tree, where the parameters for the power-law function are A = 17 467 and k = 2.62, and the correlation coefficient between the power-law function and the real distribution curve is greater than 0.969.

**Figure 5.**
A hierarchical structure formed by HCG-10 and its descendant clusters, where most of the genes belonging to HCG-10 are annotated as *DNA-binding regulatory genes*. Each rectangular or circular node corresponds to a cluster, whereas a triangular node represents a group of genes that cannot be further clustered. The shape of a node reflects whether the cluster contains multiple genes from the same genome, with the rectangular standing for *yes* and the circular standing for no. The color of a node reflects the taxonomic lineages of the genomes being covered by the cluster, where a solid color represents that all the genomes being covered belong to the same taxonomic lineage for which the color stands, and a color with white interior represents that most (but not all) of the genomes being covered belong to the taxonomic lineage for which the color stands. The annotations accompanying the clusters are summarized from the NCBI annotations of the genes being included.

**Figure 6.**
Genes of HCG-10.7 and their functional equivalence relationships (as measured by their f (·, ·) values), as represented by nodes and edges, respectively. The layout of the nodes and edges is generated using the Pajek Software, where both the Euclidean distance between two genes and the darkness of their connecting edge are roughly proportional to their f (·, ·) value. That is, the larger their f (·, ·) value is, the closer their two nodes are located, and the darker their connecting edge is. The red-colored genes, most of which are annotated as *basR*/*pmrA*, and the blue-colored genes, most of which are annotated as *ygiX*/*qseB*, are grouped into two different child clusters of HCG-10.7; whereas, the green-colored genes are those that cannot be further grouped.

**Figure 7.**
Tree structure formed by cluster HCG-424 and its descendant clusters, where all the 292 genes belonging to HCG-424 are annotated as *ribonucleotide reductase genes*. Each rectangular or circular node corresponds to a cluster, whereas a triangular node represents a group of genes that cannot be further clustered. The shape of a node reflects whether the cluster contains multiple genes from the same genome, with the rectangular standing for *yes* and the circular standing for no. The color of a node reflects the taxonomic lineages of the genomes being covered by the cluster, where a solid color represents that all the genomes being covered belong to the same taxonomic lineage for which the color stands, and a color with white interior represents that most (but not all) of the genomes being covered belong to the taxonomic lineage for which the color stands. The annotations accompanying the clusters are summarized from the NCBI annotations of the genes being included.

**Figure 8.**
The distribution of d_taxonomy(g₁, g₂) for all the genes being covered by the HCG prediction, where SK stands for super-kingdom, and *beyond* means that two genes do not even belong to the same super-kingdom.

**Figure 9.**
The distribution of d_taxonomy(g₁, g₂) at the root, middle and leaf levels of the HCG, relative to the background distribution of d_taxonomy(g₁, g₂), where SK stands for super-kingdom, and *beyond* means that two genes do not even belong to the same super-kingdom. Each bin represents the ratio between the percentage of the gene pairs at a particular HCG level and the percentage of the background gene pairs that have the same taxonomic distance level.

See this image and copyright information in PMC

Cited by

HGD: an integrated homologous gene database across multiple species.
Duan G, Wu G, Chen X, Tian D, Li Z, Sun Y, Du Z, Hao L, Song S, Gao Y, Xiao J, Zhang Z, Bao Y, Tang B, Zhao W. Duan G, et al. Nucleic Acids Res. 2023 Jan 6;51(D1):D994-D1002. doi: 10.1093/nar/gkac970. Nucleic Acids Res. 2023. PMID: 36318261 Free PMC article.

References

1. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
1. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. - PMC - PubMed
1. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. - PMC - PubMed
1. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. - PubMed
1. Storm CE, Sonnhammer EL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hierarchical classification of functionally equivalent genes in prokaryotes

Affiliation

Hierarchical classification of functionally equivalent genes in prokaryotes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources