. 2008 Dec 4:9:518.

doi: 10.1186/1471-2105-9-518.

Algorithm of OMA for large-scale orthology inference

Alexander C J Roth¹, Gaston H Gonnet, Christophe Dessimoz

Affiliations

PMID: 19055798
PMCID: PMC2639434
DOI: 10.1186/1471-2105-9-518

Algorithm of OMA for large-scale orthology inference

Alexander C J Roth et al. BMC Bioinformatics. 2008.

. 2008 Dec 4:9:518.

doi: 10.1186/1471-2105-9-518.

Authors

Alexander C J Roth¹, Gaston H Gonnet, Christophe Dessimoz

Affiliation

¹ ETH Zurich, and Swiss Institute of Bioinformatics, Zurich, Switzerland. alexande@inf.ethz.ch

PMID: 19055798
PMCID: PMC2639434
DOI: 10.1186/1471-2105-9-518

Erratum in

BMC Bioinformatics.2009;10. doi:10.1186/1471-2105-10-220

Abstract

Background: OMA is a project that aims to identify orthologs within publicly available, complete genomes. With 657 genomes analyzed to date, OMA is one of the largest projects of its kind.

Results: The algorithm of OMA improves upon standard bidirectional best-hit approach in several respects: it uses evolutionary distances instead of scores, considers distance inference uncertainty, includes many-to-many orthologous relations, and accounts for differential gene losses. Herein, we describe in detail the algorithm for inference of orthology and provide the rationale for parameter selection through multiple tests.

Conclusion: OMA contains several novel improvement ideas for orthology inference and provides a unique dataset of large-scale orthology assignments.

PubMed Disclaimer

Figures

**Figure 1**
**Algorithm flow chart**. Boxes represent the steps of the algorithm, and arrows are the input and output data for each step.

**Figure 2**
**Tests to determine the optimum length criterion value**. The fraction of candidate pairs that pass the triangle inequality test and the fraction that have the same number of domains increases with stricter (higher) length tolerance. In contrast, the number of predicted orthologous relationships decreases with stricter length tolerance. To consider only alignments that cover at least a fraction of the sequence length, a length criterion value of ℓ = 0.61 is used in this study.

**Figure 3**
**Different methods to find potential orthologs**. The mutual best alignment can be determined by similarity score or by evolutionary distance (columns), and with or without the use of a tolerance to include multiple orthologous relationships (rows).

**Figure 4**
**Test to distinguish in-paralogs and out-paralogs**. A What is the relationship of sequences y₁and y₂with regard to x? Identify which branch to place the root by finding an out-group sequence z. B If the root on the branch leading to x, then y₁and y₂are out-paralogs. C If duplication takes place after speciation, y₁and y₂are in-paralogs. D To test if y₁and y₂are in-paralogs, we confirm that the distance d of the internal branch is greater than zero.

**Figure 5**
**Value for stable pair tolerance parameter**. The fraction of stable pairs that pass the out-paralog test has a local optimum at the SP-tolerance 1.81, for five different length criteria. Increasing the tolerance value results in a larger fraction of stable pairs suspected as out-paralogs.

**Figure 6**
**Assignment of potential paralogs**. A In an evolutionary scenario, an ancestral gene is duplicated, followed by two speciation events, followed by asymmetrical gene loss of genes x₂and y₁. The paralogous genes x₁and y₂could be mistaken for orthologs, but the duplicates are retained in genome Z that can act as a witness of non-orthology. B Schematic for verifying a stable pair between x₁and y₂using genome Z. If (x₁, z₁) and (y₂, z₂) form stable pairs and are the closest relatives then x₁and y₂are paralogs and were not verified. C The only possible quartet formed when (x₁, z₁) and (y₂, z₂) are the closest related sequences is shown.

**Figure 7**
**Value for verified pair tolerance parameter**. The fraction of verified pairs that passes the out-paralog test is drawn. The top curve is produced with the use of the optimal previous parameters, and the lower curves are produced at other parameter settings and also have locally optimal values, both show similar optimal values (1.53) as the best curve.

**Figure 8**
A An example graph containing one 4-clique, four 3-cliques, and eight 2-cliques is provided. The highest scoring partition of the graph is {w₁, x₁, z₁}, {y₂, z₂}. B A possible evolutionary scenario corresponding to the graph.

**Figure 9**
**Evolutionary relations and corresponding classes of pairs**. The hierarchy of pairs are classified according to evolutionary relations. We seek to find the borders of pairs to capture underlying evolutionary relations. Verified pairs are designed to cover all orthologs, and group pairs are a subset of the closest orthologs. Broken pairs are cases where paralogy is explicitly classified.

**Figure 10**
**Number of pairs reported after each step**. Each step of the algorithm reduces the number of pairs, and the largest reduction is observed with the formation of stable pairs.

**Figure 11**
**Distribution of group size**. The average group size is drawn for several versions of orthologous matrices. For large sets of genomes (e.g. All and Bacteria) very few groups are full (i.e. have one member from each genome).

See this image and copyright information in PMC

References

1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. doi: 10.2307/2412448. - DOI - PubMed
1. Sonnhammer ELL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002;18:619–620. doi: 10.1016/S0168-9525(02)02793-2. - DOI - PubMed
1. Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000;7:429–447. doi: 10.1089/106652700750050871. - DOI - PubMed
1. Storm CEV, Sonnhammer ELL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002;18:92–99. doi: 10.1093/bioinformatics/18.1.92. - DOI - PubMed
1. Zmasek CM, Eddy SR. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics. 2002;3:14. doi: 10.1186/1471-2105-3-14. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Algorithm of OMA for large-scale orthology inference

Affiliation

Algorithm of OMA for large-scale orthology inference

Authors

Affiliation

Erratum in

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources