Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jul 11;34(11):3309-16.
doi: 10.1093/nar/gkl433. Print 2006.

Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits

Affiliations

Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits

Christophe Dessimoz et al. Nucleic Acids Res. .

Abstract

Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A simple evolutionary scenario under which the COG algorithm groups paralogous sequences.
Figure 2
Figure 2
Suitable case of a witness. A duplication occurred before all speciations and Z is a witness of the non-orthology between the sequences x1 and y2.
Figure 3
Figure 3
Unsuitable cases of witnesses. To the left, duplication occurred only in Z, and therefore z3 and z4 are in-paralogs with respect to (X, Y) and cannot act as witness of non-orthology. To the right, X speciated before the duplication event. Hence, x1 is orthologous to all three other proteins and cannot act as witness of non-orthology.
Figure 4
Figure 4
Unrooted phylogenetic consensus tree constructed from a Bayesian analysis of a subgroup from COG0508. Posterior probabilities are indicated to the right of the nodes and clan-supporting bootstrap values are indicated below the probability value. Predicted clans are indicated by the vertical bars on the right side. The leaf labels correspond to the following COG identifiers: Agrobacterium tumefaciens (2: AGl2719, 3: AGc4775, 4: AGc2641), Brucella melitensis (2: BMEII0746, 3: BMEI0141, 4: BMEI0856), Buchnera sp. (1: BU206, 3: BU303), E.coli K12 (COG identifier corresponds to the gene name: aceF, sucB), E.coli H7 (1: ECs0119, 3: ECs0752), Haemophilus influenzae (1: HI1232, 3: HI1661), Neisseria meningitidis (1: NMB1342, 3: NMB0956), Pasteurella multocida (1: PM0894, 3: PM0278), Pseudomonas aeruginosa (1: PA5016, 2: PA2249, 3: PA1586), Rhizobium loti (2: mll4471, 3: mll4300, 4a: mlr0385, 4b: mll3627), Rhizobium meliloti (2: SMc03203, 3a: SMc02483, 3b: SMb20019, 4: SMc01032), Rickettsia conorii (3: RC0226, 4: RC0764), Rickettsia prowazekii (3: RP179, 4: RP530), Vibrio cholerae (1: VC2413, 3: VC2086), Y.pestis (1: YPO3418, 3: YPO1114).
Figure 5
Figure 5
Unrooted phylogenetic consensus tree for COG0513, constructed from a Bayesian analysis. Posterior probabilities are drawn to the right of the nodes and clan-supporting bootstrap values are below the relevant nodes. The vertical bars bars to the right indicate the prediced clans. The leaf labels correspond to the COG identifiers: A.tumefaciens (2: AGl1362, 5: AGc4238, 6: AGc3366), B.melitensis (2: BMEI1824, 5: BMEI0934, 6: BMEI1035), E.coli K12 (COG identifier corresponds to the gene name: dbpA, deaD, rhlB, rhlE, srmB), H.influenzae (1: HI0422, 3: HI0231, 4: HI0892), P.multocida (1: PM1840, 3: PM1112, 4: PM1921), P.aeruginosa (2: PA0455, 3: PA2840, 4: PA3861, 5: PA0428), R.loti (2: mlr4393, 5: mlr0349, 6: mll0224), R.meliloti (2: SMc01090, 5: SMb20880, 6: SMc00522), V.cholerae (1: VC0660, 2: VC2564, 4: VC0305, 5: VCA0204), Y.pestis (1: YPO2708, 2: YPO1776, 3: YPO3488, 4: YPO3869).
Figure 6
Figure 6
Phylogenetic consensus tree rooted by outgroups for COG1113, constructed from a Bayesian analysis of a data subgroup from COG1113. Posterior probabilities of the Bayesian analysis are drawn to the right of the nodes and clan-supporting bootstrap values below relevant nodes. Predicted clans are indicated by vertical bars to the right. The leaf labels correspond to the COG identifiers: A.tumefaciens C58 (6: AGl2082), Bacillus halodurans (out: BH2171), B.melitensis (5: BMEII0038), E.coli K12 (COG identifier corresponds to the gene name: ansP, aroP, cycA, gabP, pheP, proY, yifK), E.coli H7 EDL933 (1: ZpheP, 2: ZaroP, 3: ZyifK, 4: ZproY, 5: ZcycA, 6: ZansP, 7: ZgabP), E.coli H7 (1: ECs0614, 2: ECs0116, 3: ECs4729, 4: ECs0452, 5: ECs5186, 6: ECs2057, 7: ECs3524), P.aeruginosa (2a: PA3000, 2b: PA0866, 4a: PA5097, 4b: PA0789, 7: PA0129, out: PA2079), Salmonella typhimurium LT2 (1: STM0568, 2: STM0150, 3: STM3930, 4: STM0400, 5: STM4398, 6: STM1584, 7: STM2793), Y.pestis (2a: YPO3421, 2b: YPO1743, 3: YPO3854, 4a: YPO3201, 4b: YPO4015, 5: YPO1859, 6: YPO1937).

Similar articles

Cited by

References

    1. Fitch W.M. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. - PubMed
    1. Koonin E.V. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 2005;39:309–338. - PubMed
    1. Tatusov R.L., Koonin E.V., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
    1. Tatusov R.L., Fedorova N.D., Jackson J.D., Jacobs A.R., Kiryutin B., Koonin E.V., Krylov D.M., Mazumder R., Mekhedov S.L., Nikolskaya A.N., et al. The cog database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. - PMC - PubMed
    1. Fujibuchi W., Ogata H., Matsuda H., Kanehisa M. Automatic detection of conserved gene clusters in multiple genomes by graph comparison and p-quasi grouping. Nucleic Acids Res. 2000;28:4029–4036. - PMC - PubMed

Publication types