Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Dec 4:9:518.
doi: 10.1186/1471-2105-9-518.

Algorithm of OMA for large-scale orthology inference

Affiliations

Algorithm of OMA for large-scale orthology inference

Alexander C J Roth et al. BMC Bioinformatics. .

Erratum in

  • BMC Bioinformatics.2009;10. doi:10.1186/1471-2105-10-220

Abstract

Background: OMA is a project that aims to identify orthologs within publicly available, complete genomes. With 657 genomes analyzed to date, OMA is one of the largest projects of its kind.

Results: The algorithm of OMA improves upon standard bidirectional best-hit approach in several respects: it uses evolutionary distances instead of scores, considers distance inference uncertainty, includes many-to-many orthologous relations, and accounts for differential gene losses. Herein, we describe in detail the algorithm for inference of orthology and provide the rationale for parameter selection through multiple tests.

Conclusion: OMA contains several novel improvement ideas for orthology inference and provides a unique dataset of large-scale orthology assignments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Algorithm flow chart. Boxes represent the steps of the algorithm, and arrows are the input and output data for each step.
Figure 2
Figure 2
Tests to determine the optimum length criterion value. The fraction of candidate pairs that pass the triangle inequality test and the fraction that have the same number of domains increases with stricter (higher) length tolerance. In contrast, the number of predicted orthologous relationships decreases with stricter length tolerance. To consider only alignments that cover at least a fraction of the sequence length, a length criterion value of ℓ = 0.61 is used in this study.
Figure 3
Figure 3
Different methods to find potential orthologs. The mutual best alignment can be determined by similarity score or by evolutionary distance (columns), and with or without the use of a tolerance to include multiple orthologous relationships (rows).
Figure 4
Figure 4
Test to distinguish in-paralogs and out-paralogs. A What is the relationship of sequences y1 and y2 with regard to x? Identify which branch to place the root by finding an out-group sequence z. B If the root on the branch leading to x, then y1 and y2 are out-paralogs. C If duplication takes place after speciation, y1 and y2 are in-paralogs. D To test if y1 and y2 are in-paralogs, we confirm that the distance d of the internal branch is greater than zero.
Figure 5
Figure 5
Value for stable pair tolerance parameter. The fraction of stable pairs that pass the out-paralog test has a local optimum at the SP-tolerance 1.81, for five different length criteria. Increasing the tolerance value results in a larger fraction of stable pairs suspected as out-paralogs.
Figure 6
Figure 6
Assignment of potential paralogs. A In an evolutionary scenario, an ancestral gene is duplicated, followed by two speciation events, followed by asymmetrical gene loss of genes x2 and y1. The paralogous genes x1 and y2 could be mistaken for orthologs, but the duplicates are retained in genome Z that can act as a witness of non-orthology. B Schematic for verifying a stable pair between x1 and y2 using genome Z. If (x1, z1) and (y2, z2) form stable pairs and are the closest relatives then x1 and y2 are paralogs and were not verified. C The only possible quartet formed when (x1, z1) and (y2, z2) are the closest related sequences is shown.
Figure 7
Figure 7
Value for verified pair tolerance parameter. The fraction of verified pairs that passes the out-paralog test is drawn. The top curve is produced with the use of the optimal previous parameters, and the lower curves are produced at other parameter settings and also have locally optimal values, both show similar optimal values (1.53) as the best curve.
Figure 8
Figure 8
A An example graph containing one 4-clique, four 3-cliques, and eight 2-cliques is provided. The highest scoring partition of the graph is {w1, x1, z1}, {y2, z2}. B A possible evolutionary scenario corresponding to the graph.
Figure 9
Figure 9
Evolutionary relations and corresponding classes of pairs. The hierarchy of pairs are classified according to evolutionary relations. We seek to find the borders of pairs to capture underlying evolutionary relations. Verified pairs are designed to cover all orthologs, and group pairs are a subset of the closest orthologs. Broken pairs are cases where paralogy is explicitly classified.
Figure 10
Figure 10
Number of pairs reported after each step. Each step of the algorithm reduces the number of pairs, and the largest reduction is observed with the formation of stable pairs.
Figure 11
Figure 11
Distribution of group size. The average group size is drawn for several versions of orthologous matrices. For large sets of genomes (e.g. All and Bacteria) very few groups are full (i.e. have one member from each genome).

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. doi: 10.2307/2412448. - DOI - PubMed
    1. Sonnhammer ELL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002;18:619–620. doi: 10.1016/S0168-9525(02)02793-2. - DOI - PubMed
    1. Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000;7:429–447. doi: 10.1089/106652700750050871. - DOI - PubMed
    1. Storm CEV, Sonnhammer ELL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. doi: 10.1093/bioinformatics/18.1.92. - DOI - PubMed
    1. Zmasek CM, Eddy SR. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics. 2002;3:14. doi: 10.1186/1471-2105-3-14. - DOI - PMC - PubMed

LinkOut - more resources