. 2010 Nov 15:11:562.

doi: 10.1186/1471-2105-11-562.

An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology

Shobhit Jain¹, Gary D Bader

Affiliations

PMID: 21078182
PMCID: PMC2998529
DOI: 10.1186/1471-2105-11-562

An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology

Shobhit Jain et al. BMC Bioinformatics. 2010.

. 2010 Nov 15:11:562.

doi: 10.1186/1471-2105-11-562.

Authors

Shobhit Jain¹, Gary D Bader

Affiliation

¹ Department of Computer Science, University of Toronto, 10 Kings College Road, Toronto, Ontario M5S3G4, Canada.

PMID: 21078182
PMCID: PMC2998529
DOI: 10.1186/1471-2105-11-562

Abstract

Background: Semantic similarity measures are useful to assess the physiological relevance of protein-protein interactions (PPIs). They quantify similarity between proteins based on their function using annotation systems like the Gene Ontology (GO). Proteins that interact in the cell are likely to be in similar locations or involved in similar biological processes compared to proteins that do not interact. Thus the more semantically similar the gene function annotations are among the interacting proteins, more likely the interaction is physiologically relevant. However, most semantic similarity measures used for PPI confidence assessment do not consider the unequal depth of term hierarchies in different classes of cellular location, molecular function, and biological process ontologies of GO and thus may over-or under-estimate similarity.

Results: We describe an improved algorithm, Topological Clustering Semantic Similarity (TCSS), to compute semantic similarity between GO terms annotated to proteins in interaction datasets. Our algorithm, considers unequal depth of biological knowledge representation in different branches of the GO graph. The central idea is to divide the GO graph into sub-graphs and score PPIs higher if participating proteins belong to the same sub-graph as compared to if they belong to different sub-graphs.

Conclusions: The TCSS algorithm performs better than other semantic similarity measurement techniques that we evaluated in terms of their performance on distinguishing true from false protein interactions, and correlation with gene expression and protein families. We show an average improvement of 4.6 times the F1 score over Resnik, the next best method, on our Saccharomyces cerevisiae PPI dataset and 2 times on our Homo sapiens PPI dataset using cellular component, biological process and molecular function GO annotations.

PubMed Disclaimer

Figures

**Figure 1**
**Graphical illustration of the algorithm**. Nodes in the higher level graph and sub-graphs are shown by black and green circles, respectively. Root nodes of sub-graphs are shown by solid green circles and are equivalent to the corresponding higher level node. Terms A and B belong to the same sub-graph, therefore the semantic similarity score between them will be computed based on their common ancestor term 'Cytoplasm' (solid green). Terms B and C belong to different sub-graphs, therefore their semantic similarity score will be computed based on the common ancestor term 'Intracellular'.

**Figure 2**
**ROC curves for *S. cerevisiae* PPI dataset**. ROC evaluations of semantic similarity measures at different cutoffs based on the S. *cerevisiae* PPI dataset derived from DIP are shown. The evaluation was performed using the cellular component, biological process and molecular function ontologies of GO. The maximum (MAX) approach for combining multiple annotations was used on the dataset, without (IEA-) electronic annotations. TCSS and Resnik show the best ROC profiles for all three ontologies.

**Figure 3**
**F-score curves for *S. cerevisiae* PPI dataset**. F₁score (harmonic mean of precision and recall) evaluations of semantic similarity measures at different cutoffs based on the *S. cerevisiae* PPI dataset derived from DIP are shown. The evaluation was performed using cellular component, biological process, and molecular function ontologies of GO. Maximum (MAX) approach for combining multiple annotations was used on a dataset with only manual annotations (no electronic annotations (IEA-)). F₁score reaches its best value at 1 and worst at 0. TCSS does better than Resnik for semantic similarity cutoff scores in all three ontologies.

**Figure 4**
**Correlation with gene expression and CESSM dataset**. (a) Pearson correlation between gene expression similarity and semantic similarity on a S. *cerevisiae* dataset containing 5,000 randomly selected protein pairs are shown. (b - d) Correlation between semantic similarity and sequence, enzyme commission (EC), protein family (Pfam) similarity using the online CESSM tool. The evaluation was performed for cellular component (CC), biological process (BP), and molecular function (MF) ontologies of GO using maximum (MAX) approach for combining multiple GO annotations.

**Figure 5**
**Comparison of our topological clustering method and Resnik (MAX) as scoring positive and negative PPIs**. The scatter plot of semantic similarity scores for positive (red) and negative (green) interactions. Semantic similarity scores range between 0.0 and 1.0 for both methods, with 1.0 being the best. A significant number of positive interactions are under-scored by Resnik (MAX) in all three ontologies compared to TCSS.

See this image and copyright information in PMC

References

1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. - PMC - PubMed
1. Li D, Liu W, Liu Z, Wang J, Liu Q, Zhu Y, He F. PRINCESS, a protein interaction confidence evaluation system with multiple data sources. Mol Cell Proteomics. 2008;7(6):1043–1052. - PubMed
1. Patil A, Nakamura H. Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics. 2005;6:100. - PMC - PubMed
1. Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol. 2005;23(8):951–959. - PubMed
1. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE. A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005;122(6):957–968. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

MOP-84324/Canadian Institutes of Health Research/Canada

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology

Affiliation

An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases