Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Sep 11:4:41.
doi: 10.1186/1471-2105-4-41. Epub 2003 Sep 11.

The COG database: an updated version includes eukaryotes

Affiliations

The COG database: an updated version includes eukaryotes

Roman L Tatusov et al. BMC Bioinformatics. .

Abstract

Background: The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.

Results: We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or approximately 54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of approximately 20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.

Conclusion: The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Phyletic patterns of COGs. All, represented in all unicellular organisms included in the COG system; All archaea, All bacteria, All eukaryotes, represented in each species from the respective domain of life (and possibly in some species from other domains); All bacteria except the smallest, represented in all bacteria except, possibly, parasites with small genomes (mycoplasma, chlamydia, rickettsia, and spirochetes).
Figure 2
Figure 2
Phyletic patterns of KOGs. All, include representatives from each of the 7 analyzed species; All-Ec, include representatives from each of 6 species other than Encephalitozoon cuniculi; All animals, include representatives from three animal genomes only; All fungi, include representatives from two fungal genomes only.
Figure 3
Figure 3
An example of a complex eukaryotic KOG: globins and related hemoproteins. The systematic protein names of the KOG members are listed under each species. To the left of the KOG proper is the similarity dendrogram produced from the BLAST scores between the KOG members. This is a crude clustering, which should not be construed as a phylogenetic tree.
Figure 4
Figure 4
Functional classification of prokaryotic (COGs) and eukaryotic (KOGs) clusters of orthologs. Designations of functional categories: A, RNA processing and modification (not used for prokaryotic COGs), B, chromatin structure and dynamics, C, energy production and conversion, D, cell cycle control and mitosis, E, amino acid metabolism and transport, F, nucleotide metabolism and transport, G, carbohydrate metabolism and transport, H, coenzyme metabolism, I, lipid metabolism, J, translation, K, transcription, L, replication and repair, M, cell wall/membrane/envelope biogenesis, N, Cell motility, O, post-translational modification, protein turnover, chaperone functions, P, Inorganic ion transport and metabolism, Q, secondary metabolites biosynthesis, transport and catabolism, T, signal transduction, U, intracellular trafficking and secretion, Y, nuclear structure (not applicable to prokaryotic COGs), Z, cytoskeleton (not applicable to prokaryotic COGs); R, general functional prediction only (typically, prediction of biochemical activity), S, function unknown. The numbers were obtained after subtracting the COGs that consisted entirely of proteins from unicellular eukaryotes from the COG collection.
Figure 5
Figure 5
Examples of phyletic pattern search. (A) COGs represented in Encephalitozoon cuniculi but missing in the two yeasts (B) COGs represented in Yersinia pestis but not in other Proteobacteria or eukaryotesThe sets of species included in COGs are color-coded as follows (from left to right): yellow, archaea; purple, eukaryotes; green, miscellaneous bacteria, including hyperthermophiles, cyanobacteria, Fusobacterium, and Deinococcus; dark yellow, actinobacteria; torqoise, low-GC Gram-positive bacteria (except for mycoplasmas); light blue, Gamma-proteobacteria; dark-blue, Beta- and Epsilon-proteobacteria; dark gray, Alpha-proteobacteria; green, chlamydia and spirochetes; dark green, mycoplasmas. The functional categories, designated as in Fig. 4, are also color-coded.

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Systematic Zoology. 1970;19:99–106. - PubMed
    1. Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000;16:227–231. doi: 10.1016/S0168-9525(00)02005-9. - DOI - PubMed
    1. Henikoff S, Greene EA, Pietrokovski S, Bork P, Attwood TK, Hood L. Gene families: the taxonomy of protein paralogs and chimeras. Science. 1997;278:609–614. doi: 10.1126/science.278.5338.609. - DOI - PubMed
    1. Sonnhammer EL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002;18:619–620. doi: 10.1016/S0168-9525(02)02793-2. - DOI - PubMed
    1. Wilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000;297:233–249. doi: 10.1006/jmbi.2000.3550. - DOI - PubMed