Phylogenetic and functional assessment of orthologs inference projects and methods

Adrian M Altenhoff¹, Christophe Dessimoz

Affiliations

PMID: 19148271
PMCID: PMC2612752
DOI: 10.1371/journal.pcbi.1000262

Phylogenetic and functional assessment of orthologs inference projects and methods

Adrian M Altenhoff et al. PLoS Comput Biol. 2009 Jan.

. 2009 Jan;5(1):e1000262.

doi: 10.1371/journal.pcbi.1000262. Epub 2009 Jan 16.

Authors

Adrian M Altenhoff¹, Christophe Dessimoz

Affiliation

¹ Institute of Computational Science, ETH Zurich, and Swiss Institute of Bioinformatics, Zürich, Switzerland. adrian.altenhoff@inf.ethz.ch

PMID: 19148271
PMCID: PMC2612752
DOI: 10.1371/journal.pcbi.1000262

Abstract

Accurate genome-wide identification of orthologs is a central problem in comparative genomics, a fact reflected by the numerous orthology identification projects developed in recent years. However, only a few reports have compared their accuracy, and indeed, several recent efforts have not yet been systematically evaluated. Furthermore, orthology is typically only assessed in terms of function conservation, despite the phylogeny-based original definition of Fitch. We collected and mapped the results of nine leading orthology projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and reciprocal smallest distance). We systematically compared their predictions with respect to both phylogeny and function, using six different tests. This required the mapping of millions of sequences, the handling of hundreds of millions of predicted pairs of orthologs, and the computation of tens of thousands of trees. In phylogenetic analysis or in functional analysis where high specificity is required, we find that OMA and Homologene perform best. At lower functional specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG can be of interest to build broad functional grouping, but the method is not specific enough for phylogenetic or detailed function analyses. In terms of general methodology, we observe that the more sophisticated tree reconstruction/reconciliation approach of Ensembl Compara was at times outperformed by pairwise comparison approaches, even in phylogenetic tests. Furthermore, we show that standard bidirectional best-hit often outperforms projects with more complex algorithms. First, the present study provides guidance for the broad community of orthology data users as to which database best suits their needs. Second, it introduces new methodology to verify orthology. And third, it sets performance standards for current and future approaches.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Number of complete genomes analyzed by the different projects.**

**Figure 2. Results of phylogenetic tree test.**
The mean fraction of correct split of ML trees for gene trees from three different kingdoms are shown. The higher the values, the better the gene trees agree with the species tree. On the left, the pairwise results between every project and OMA are shown, whereas on the right, the result for the comparison on the common set of proteins of a larger number of projects is shown. Note that the pairwise project comparisons are made based on varying protein sets, and thus can not be compared to each other. Error bars indicate the 95% confidence intervals of the estimated means. Projects with too little appropriate data could not be evaluated, which explains absent bars.

**Figure 3. Results of benchmarks from literature.**
Performance on manually curated gene trees from 4 published studies. ,,,. (A) The pairwise outcome of every project against OMA are shown, indicated with the relative difference of the true positive rate between OMA and its counter project versus their relative difference of the false-positive rate. (B) Performance for the protein intersection dataset. Shown are the true positive rate (sensitivity) versus the false-positive rate (1 - specificity). In both plots, the error bars indicate the 95% confidence interval and the “better arrow” points into the direction of higher specificity and sensitivity. Projects lying in the gray area are dominated, in (A) by “OMA Pairwise” and in (B) by at least one other project.

**Figure 4. Results of functional based tests.**
Results of functional conservation tests for GO similarity, EC number expression correlation and gene neighborhood conservation. In the pairwise project comparisons (left) the relative difference of functional similarity between OMA and its counter project versus the relative difference of the number of predicted orthologs are shown. In the comparison on the intersection set (right), the mean functional similarity versus the number of predicted orthologs on the common set of sequences are shown. The vertical error bars in all the results state the 95% confidence interval of the means. The “better arrow” indicates the direction towards higher specificity and sensitivity. Projects lying in the gray area are dominated by “OMA Pairwise” in the pairwise comparison (left) and by at least one other project in the intersection comparison (right).

See this image and copyright information in PMC

References

1. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
1. Remm M, Storm C, Sonnhammer E. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001;314:1041–1052. - PubMed
1. Li L, Stoeckert CJJ, Roos DS. Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. - PMC - PubMed
1. Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, et al. McLysath A, Huson DH, editors. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: Introduction and first achievements. RECOMB 2005 Workshop on Comparative Genomics. 2005. pp. 61–72. Springer-Verlag, volume LNBI 3678 of Lecture Notes in Bioinformatics.
1. DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, et al. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006;22:2044–2046. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Phylogenetic and functional assessment of orthologs inference projects and methods

Affiliation

Phylogenetic and functional assessment of orthologs inference projects and methods

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources