. 2011 Apr 28:12:124.

doi: 10.1186/1471-2105-12-124.

Proteinortho: detection of (co-)orthologs in large-scale analysis

Marcus Lechner¹, Sven Findeiss, Lydia Steiner, Manja Marz, Peter F Stadler, Sonja J Prohaska

Affiliations

PMID: 21526987
PMCID: PMC3114741
DOI: 10.1186/1471-2105-12-124

Proteinortho: detection of (co-)orthologs in large-scale analysis

Marcus Lechner et al. BMC Bioinformatics. 2011.

. 2011 Apr 28:12:124.

doi: 10.1186/1471-2105-12-124.

Authors

Marcus Lechner¹, Sven Findeiss, Lydia Steiner, Manja Marz, Peter F Stadler, Sonja J Prohaska

Affiliation

¹ RNA Bioinformatics Group, Department of Pharmaceutical Chemistry, Philipps-University Marburg, Germany. lechner@staff.uni-marburg.de

PMID: 21526987
PMCID: PMC3114741
DOI: 10.1186/1471-2105-12-124

Abstract

Background: Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases.

Results: The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes.

Conclusions: Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.

PubMed Disclaimer

Figures

**Figure 1**
**Orthology relations**. Idealized dataset for two species A and B. Proteins x ∈ A and y ∈ B are depicted by open boxes. Orthology relations between proteins x and y are represented by grey shadows. Arrows indicate alignments above a certain cut-off from the search of x against B. Solid lines refer to the best alignments. Cases (1), (2), and (3) cannot occur by definition in an idealized dataset, but of course do appear in real life applications.

**Figure 2**
**Adaptive RBAH**. Reciprocal best alignment heuristic (RBAH). (a) If there is a pair of divergent co-orthologs x₁, , and y₁and , resp., it is possible that there are no *reciprocal* best blast alignments. In this situation, RBAH will not identify any orthologs. (b) One possible remedy is to include the second best blast alignment (n = 2). However, in this case highly similar orthologs (x₂and y₂as well as x₃and y₃), which in principle can clearly be divided, can get combined. (c) Proteinortho uses an adaptive approach that is (1) flexible with respect to the number of more diverged orthologs in absence of a reciprocal best blast alignment and (2) will not intermix orthologous groups that can be disentangled easily because of large differences in pairwise similarity.

formula image — **Figure 2**
**Adaptive RBAH**. Reciprocal best alignment heuristic (RBAH). (a) If there is a pair of divergent co-orthologs x₁, , and y₁and , resp., it is possible that there are no *reciprocal* best blast alignments. In this situation, RBAH will not identify any orthologs. (b) One possible remedy is to include the second best blast alignment (n = 2). However, in this case highly similar orthologs (x₂and y₂as well as x₃and y₃), which in principle can clearly be divided, can get combined. (c) Proteinortho uses an adaptive approach that is (1) flexible with respect to the number of more diverged orthologs in absence of a reciprocal best blast alignment and (2) will not intermix orthologous groups that can be disentangled easily because of large differences in pairwise similarity.

**Figure 3**
**Benchmarks**. CPU time and memory requirements of Proteinortho. (a) The speed benchmark was performed using an *E. coli* strain with 4132 proteins on an eight core Intel Xeon system using one thread (1) and eight threads (8) at 2.33 GHz. The encoded proteins were used multiple times to simulate multiple (identical) species. This is the worst case scenario for Proteinortho since in this case every protein has a link to at least one protein in every other species. Proteinortho is significantly faster than OrthoMCL. Using multiple threads we observe a substantial speed up. (b) The memory benchmark is performed using the same set as in (a). OrthoMCL quickly exhausts memory for larger sets. Proteinortho clearly performs more efficient, even though this artificial scenario is a more complex case than real world analysis. Both benchmarks outline that Proteinortho allows comprehensive studies which were not possible before.

**Figure 4**
**Coverage completion**. Comparison of the results of Proteinortho with different thresholds of the normalized algebraic connectivity with the COG-database and OrthoMCL for a dataset consisting of 16 randomly chosen bacterial proteomes. The vertical dashed line marks the transition from clusters containing mainly a single ortholog from each species to sets including co-orthologs. The COG-database reports many large groups which often include co-orthologous proteins. OrthoMCL and Proteinortho focus on highly connected subsets in order to find orthologous sets and thus split those groups. Thereby, Proteinortho's clustering algorithm becomes more stringent with increasing values of in splitting in particular large groups. While these groups are left intact for , thresholds of 0.5 and higher drastically reduce the fraction of included co-orthologs.

**Figure 5**
**Comparison of results**. Comparison of OrthoMCL and Proteinortho to the COG-database. The following assignments were defined: identity: the group equals a COG-group; subset: the group is subset of a COG-group, at least two proteins are equal; superset: the group is a superset of a COG-group, at least two proteins are equal; new: none of the above-noted criteria matched. Both tools reveal comparable results with respect to the manually curated COG-database. OrthoMCL covers more identical and differently composed groups while Proteinortho is more restrictive and reports fewer new groups which are not present in the COG-database. All groups with less than six species were omitted from the OrthoMCL and Proteinortho data. See Additional File 2 for comparisons with different minimal coverage.

**Figure 6**
**Orthologous groups**. Number of orthologous groups present in nearly all bacterial species. The dashed line represents Proteinortho results based on the NCBI annotation. Using tblastn the annotation was complemented with high scoring genomic matches (solid curve). Note that this is a cumulative plot, i.e., each group of co-orthologs present in x species is also included in the count of groups contained in x' <x species.

See this image and copyright information in PMC

References

1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19(2):99–113. doi: 10.2307/2412448. - DOI - PubMed
1. Berglund AC, Sjölund E, Ostlund G, Sonnhammer EL. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Res. 2008. pp. D263–266. - PMC - PubMed
1. Chen F, Mackey AJ, Stoeckert CJ, Jr, Roos DS. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006. pp. D363–368. - PMC - PubMed
1. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. - DOI - PMC - PubMed
1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008. pp. D13–21. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Proteinortho: detection of (co-)orthologs in large-scale analysis

Affiliation

Proteinortho: detection of (co-)orthologs in large-scale analysis

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources