Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr 28:12:124.
doi: 10.1186/1471-2105-12-124.

Proteinortho: detection of (co-)orthologs in large-scale analysis

Affiliations

Proteinortho: detection of (co-)orthologs in large-scale analysis

Marcus Lechner et al. BMC Bioinformatics. .

Abstract

Background: Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases.

Results: The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes.

Conclusions: Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Orthology relations. Idealized dataset for two species A and B. Proteins x A and y B are depicted by open boxes. Orthology relations between proteins x and y are represented by grey shadows. Arrows indicate alignments above a certain cut-off from the search of x against B. Solid lines refer to the best alignments. Cases (1), (2), and (3) cannot occur by definition in an idealized dataset, but of course do appear in real life applications.
Figure 2
Figure 2
Adaptive RBAH. Reciprocal best alignment heuristic (RBAH). (a) If there is a pair of divergent co-orthologs x1, formula image, and y1 and formula image, resp., it is possible that there are no reciprocal best blast alignments. In this situation, RBAH will not identify any orthologs. (b) One possible remedy is to include the second best blast alignment (n = 2). However, in this case highly similar orthologs (x2 and y2 as well as x3 and y3), which in principle can clearly be divided, can get combined. (c) Proteinortho uses an adaptive approach that is (1) flexible with respect to the number of more diverged orthologs in absence of a reciprocal best blast alignment and (2) will not intermix orthologous groups that can be disentangled easily because of large differences in pairwise similarity.
Figure 3
Figure 3
Benchmarks. CPU time and memory requirements of Proteinortho. (a) The speed benchmark was performed using an E. coli strain with 4132 proteins on an eight core Intel Xeon system using one thread (1) and eight threads (8) at 2.33 GHz. The encoded proteins were used multiple times to simulate multiple (identical) species. This is the worst case scenario for Proteinortho since in this case every protein has a link to at least one protein in every other species. Proteinortho is significantly faster than OrthoMCL. Using multiple threads we observe a substantial speed up. (b) The memory benchmark is performed using the same set as in (a). OrthoMCL quickly exhausts memory for larger sets. Proteinortho clearly performs more efficient, even though this artificial scenario is a more complex case than real world analysis. Both benchmarks outline that Proteinortho allows comprehensive studies which were not possible before.
Figure 4
Figure 4
Coverage completion. Comparison of the results of Proteinortho with different thresholds of the normalized algebraic connectivity formula image with the COG-database and OrthoMCL for a dataset consisting of 16 randomly chosen bacterial proteomes. The vertical dashed line marks the transition from clusters containing mainly a single ortholog from each species to sets including co-orthologs. The COG-database reports many large groups which often include co-orthologous proteins. OrthoMCL and Proteinortho focus on highly connected subsets in order to find orthologous sets and thus split those groups. Thereby, Proteinortho's clustering algorithm becomes more stringent with increasing values of formula image in splitting in particular large groups. While these groups are left intact for formula image, thresholds of 0.5 and higher drastically reduce the fraction of included co-orthologs.
Figure 5
Figure 5
Comparison of results. Comparison of OrthoMCL and Proteinortho to the COG-database. The following assignments were defined: identity: the group equals a COG-group; subset: the group is subset of a COG-group, at least two proteins are equal; superset: the group is a superset of a COG-group, at least two proteins are equal; new: none of the above-noted criteria matched. Both tools reveal comparable results with respect to the manually curated COG-database. OrthoMCL covers more identical and differently composed groups while Proteinortho is more restrictive and reports fewer new groups which are not present in the COG-database. All groups with less than six species were omitted from the OrthoMCL and Proteinortho data. See Additional File 2 for comparisons with different minimal coverage.
Figure 6
Figure 6
Orthologous groups. Number of orthologous groups present in nearly all bacterial species. The dashed line represents Proteinortho results based on the NCBI annotation. Using tblastn the annotation was complemented with high scoring genomic matches (solid curve). Note that this is a cumulative plot, i.e., each group of co-orthologs present in x species is also included in the count of groups contained in x' <x species.

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19(2):99–113. doi: 10.2307/2412448. - DOI - PubMed
    1. Berglund AC, Sjölund E, Ostlund G, Sonnhammer EL. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Res. 2008. pp. D263–266. - PMC - PubMed
    1. Chen F, Mackey AJ, Stoeckert CJ, Jr, Roos DS. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006. pp. D363–368. - PMC - PubMed
    1. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. - DOI - PMC - PubMed
    1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008. pp. D13–21. - PMC - PubMed

Publication types

LinkOut - more resources