Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(7):e42057.
doi: 10.1371/journal.pone.0042057. Epub 2012 Jul 26.

Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction

Affiliations

Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction

Vijaykumar Yogesh Muley et al. PLoS One. 2012.

Abstract

Background: Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions.

Methods: We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods.

Conclusions: Higher performance for predicting protein-protein interactions was achievable even with 100-150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50-100 genomes for comparable accuracy of predictions when computational resources are limited.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. ROC curves for six reference genome sets using Phylogenetic Profiling Methods.
The solid lines depict the phylogenetic profile constructed using normalized bit scores (SPPM) whereas the dotted lines depict the binary phylogenetic profile (BPPM). The colors of the lines correspond to the six reference genome sets (ALL, BAAC, BAS, BAC, GAMMA and BANR) for which performance was evaluated. As evident in the figure, SPPM gives superior performance compared to BPPM for all reference genome sets. The ROC curves clearly show that the reference genome selection has profound influence on the performance of BPPM compared to that of SPPM.
Figure 2
Figure 2. ROC curves for six reference genome sets using Gene Cluster Method.
The colors of the lines correspond to the six reference genome sets (ALL, BAAC, BAS, BAC, GAMMA and BANR) for which performance was evaluated. The reference genome set GAMMA relatively performs better.
Figure 3
Figure 3. ROC curves for six reference genome sets using Minimum Distance Method.
The colors of the lines correspond to the six reference genome sets (ALL, BAAC, BAS, BAC, GAMMA and BANR) for which performance was evaluated. ROC plot shows that the method is robust against choice of reference genome sets. All reference sets performed equally well.
Figure 4
Figure 4. ROC curves for four reference genome sets using Mirrortree based methods.
We have used here two variants of the mirrortree methods i.e. the Tol-mirrortree and GD-mirrortree. The Tol-mirrortree (represented by dotted lines in the plot) uses 16S rRNA distance between two genomes as a factor to correct the phylogenetic distance whereas the GD-mirrortree (represented by solid lines in the plot) uses a genomic distance parameter reflecting the shared orthologs between two genomes to correct the corresponding phylogenetic distance (See methods for detail). The colors of the lines correspond to the four reference genome sets (BAS, BAC, GAMMA and BANR) for which performance was evaluated. The plot clearly shows that the GD-mirrortree method is superior to Tol-mirrortree method for these four reference genome sets. BAS and BAC perform better than GAMMA and BANR with comparable level of accuracy.

References

    1. Shoemaker BA, Panchenko AR (2007) Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol 3: e43. - PMC - PubMed
    1. Yamada T, Bork P (2009) Evolution of biomolecular networks: lessons from metabolic and protein interactions. Nat Rev Mol Cell Biol 10: 791–803. - PubMed
    1. Zhu X, Gerstein M, Snyder M (2007) Getting connected: analysis and principles of biological networks. Genes Dev 21: 1010–1024. - PubMed
    1. Janga SC, Diaz-Mejia JJ, Moreno-Hagelsieb G (2011) Network-based function prediction and interactomics: the case for metabolic enzymes. Metab Eng 13: 1–10. - PubMed
    1. Chuang HY, Hofree M, Ideker T (2010) A decade of systems biology. Annu Rev Cell Dev Biol 26: 721–744. - PMC - PubMed

Publication types

MeSH terms