Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May 13:16:154.
doi: 10.1186/s12859-015-0570-8.

Domain similarity based orthology detection

Affiliations

Domain similarity based orthology detection

Tristan Bitard-Feildel et al. BMC Bioinformatics. .

Abstract

Background: Orthologous protein detection software mostly uses pairwise comparisons of amino-acid sequences to assert whether two proteins are orthologous or not. Accordingly, when the number of sequences for comparison increases, the number of comparisons to compute grows in a quadratic order. A current challenge of bioinformatic research, especially when taking into account the increasing number of sequenced organisms available, is to make this ever-growing number of comparisons computationally feasible in a reasonable amount of time. We propose to speed up the detection of orthologous proteins by using strings of domains to characterize the proteins.

Results: We present two new protein similarity measures, a cosine and a maximal weight matching score based on domain content similarity, and new software, named porthoDom. The qualities of the cosine and the maximal weight matching similarity measures are compared against curated datasets. The measures show that domain content similarities are able to correctly group proteins into their families. Accordingly, the cosine similarity measure is used inside porthoDom, the wrapper developed for proteinortho. porthoDom makes use of domain content similarity measures to group proteins together before searching for orthologs. By using domains instead of amino acid sequences, the reduction of the search space decreases the computational complexity of an all-against-all sequence comparison.

Conclusion: We demonstrate that representing and comparing proteins as strings of discrete domains, i.e. as a concatenation of their unique identifiers, allows a drastic simplification of search space. porthoDom has the advantage of speeding up orthology detection while maintaining a degree of accuracy similar to proteinortho. The implementation of porthoDom is released using python and C++ languages and is available under the GNU GPL licence 3 at http://www.bornberglab.org/pages/porthoda .

PubMed Disclaimer

Figures

Figure 1
Figure 1
ROC curves. ROC curves of the developed COS and MWM measures, and of the NC method against the SD dataset (panel a), the SD + dataset (panel b) and the OB dataset (panel c). For each panel, the left plots correspond to the full ROC curves and the right plots to a zoomed in subsection along the x axis. C O S O1, C O S O2, M W M O1 and M W M O2 are evaluated with weighting (w) or without. The influence of the kinase family in the SD + dataset on the sequence similarity based method (NC) is clearly seen in panel b.
Figure 2
Figure 2
Dotplot with domain visualisation of two proteins belonging to the PLUNC family (ENSRNOP00000052209 and ENSRNOP00000052216). The shadowed areas correspond to the sequence identity between the two sequences. Although they share the exact same DA, their sequence similarity is very low (20.8%). Run with the needle program of the EMBOSS package [35]. Dotplot was produced with the DoMosaics software [36].
Figure 3
Figure 3
Results of comparisons between porthoDom or proteinortho against the OrthoDB database. Different parameters are used for the domain content similarity step of porthoDom and the default parameters of proteinortho are used for both methods. The parameters are: a domain content similarity cut-off of 0.5, a domain content similarity of O1 corresponding to single domain comparisons, or O2 corresponding to the comparison of pairs of domains, and an option collapsing or not of tandem domain repeats. The different parameters have little influence on porthoDom due to the robustness of the domain content similarity method.

References

    1. Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, et al. TreeFam: 2008 Update. Nucleic Acids Res. 2008;36(Database issue):735–40. - PMC - PubMed
    1. Huerta-Cepas J, Capella-Gutierrez S, Pryszcz LP, Marcet-Houben M, Gabaldon T. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res. 2014;42(Database issue):897–902. doi: 10.1093/nar/gkt1177. - DOI - PMC - PubMed
    1. Li L, Stoeckert CJ, Roos DS. Ortho MCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13(9):2178–89. doi: 10.1101/gr.1224503. - DOI - PMC - PubMed
    1. Lechner M, Findeiss S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinformatics. 2011;12:124. doi: 10.1186/1471-2105-12-124. - DOI - PMC - PubMed
    1. Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J, et al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 2014;42(Database issue):231–9. doi: 10.1093/nar/gkt1253. - DOI - PMC - PubMed

Publication types

LinkOut - more resources