Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep;12(5):379-91.
doi: 10.1093/bib/bbr030. Epub 2011 Jun 19.

Computational methods for Gene Orthology inference

Affiliations

Computational methods for Gene Orthology inference

David M Kristensen et al. Brief Bioinform. 2011 Sep.

Abstract

Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple 'tree-like' mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Orthology, co-orthology and paralogy relationships in the evolution of four genes that arose from a single common ancestor.
Figure 2:
Figure 2:
The reconciliation of the species tree (a) with an instance of a gene tree (b–d) allows for inference as to when evolutionary events such as speciation (T-branch), gene duplication (star-branch), or gene loss (X) occurred. (b) Gene tree with recent duplication, and evolutionary relationships shown for the genes in the shaded area. Because all three genes diverged from a single common ancestor, they would form a single orthologous group. (c) Gene tree with duplication preceding speciation event and evolutionary relationships shown for the genes in the shaded area. These four genes form two separate orthologous groups, corresponding to the two ancestral genes leading to each distinct gene lineage (Human1 and Mouse1, and Human2 and Mouse2). (d) Gene tree with duplication prior to speciation, followed by differential gene loss of Fly1 & Mouse2, where again all of the descendants of each of the two ancestral genes form an orthologous group.
Figure 3:
Figure 3:
Grouping of genes in different species that are each others’ BBHs into sets of orthologs and co-orthologs. (a) Graph representation of the evolutionary scenario shown in Figure 1, with genes represented as vertexes and BBHs as edges. (b) A larger, completely connected orthologous group of six genes from five species. (c) An even larger group that contains some members that are not orthologs, in this case due to domain recombination, where the top members have one domain and the bottom members have another, non-homologous domain, but they were merged into the same group due to the middle members containing both domains (thus bridging the otherwise disconnected components in the BBH graph). Alternative scenarios of improper merging can involve differential gene loss or large, complex mixtures of in- and out-paralogs, but in all three cases are due to the pair-wise procedure used to add members to groups, without considering the long-range relationships to the other members in the group.

References

    1. Sayers EW, Barrett T, Benson DA, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011;39(Database issue):D38–51. - PMC - PubMed
    1. Park D, Singh R, Baym M, et al. IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 2011;39(Database issue):D295–300. - PMC - PubMed
    1. Hulsen T, Huynen MA, de Vlieg J, et al. Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 2006;7(4):R31. - PMC - PubMed
    1. Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998;8(3):163–7. - PubMed
    1. Sjolander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004;20(2):170–9. - PubMed

Publication types