. 2006 Jul 12:7:338.

doi: 10.1186/1471-2105-7-338.

Cluster analysis of protein array results via similarity of Gene Ontology annotation

Cheryl Wolting¹, C Jane McGlade, David Tritchler

Affiliations

PMID: 16836750
PMCID: PMC1539024
DOI: 10.1186/1471-2105-7-338

Cluster analysis of protein array results via similarity of Gene Ontology annotation

Cheryl Wolting et al. BMC Bioinformatics. 2006.

. 2006 Jul 12:7:338.

doi: 10.1186/1471-2105-7-338.

Authors

Cheryl Wolting¹, C Jane McGlade, David Tritchler

Affiliation

¹ Department of Medical Biophysics, University of Toronto, Toronto, Canada. cheryl.wolting@utoronto.ca

PMID: 16836750
PMCID: PMC1539024
DOI: 10.1186/1471-2105-7-338

Abstract

Background: With the advent of high-throughput proteomic experiments such as arrays of purified proteins comes the need to analyse sets of proteins as an ensemble, as opposed to the traditional one-protein-at-a-time approach. Although there are several publicly available tools that facilitate the analysis of protein sets, they do not display integrated results in an easily-interpreted image or do not allow the user to specify the proteins to be analysed.

Results: We developed a novel computational approach to analyse the annotation of sets of molecules. As proof of principle, we analysed two sets of proteins identified in published protein array screens. The distance between any two proteins was measured as the graph similarity between their Gene Ontology (GO) annotations. These distances were then clustered to highlight subsets of proteins sharing related GO annotation. In the first set of proteins found to bind small molecule inhibitors of rapamycin, we identified three subsets containing four or five proteins each that may help to elucidate how rapamycin affects cell growth whereas the original authors chose only one novel protein from the array results for further study. In a set of phosphoinositide-binding proteins, we identified subsets of proteins associated with different intracellular structures that were not highlighted by the analysis performed in the original publication.

Conclusion: By determining the distances between annotations, our methodology reveals trends and enrichment of proteins of particular functions within high-throughput datasets at a higher sensitivity than perusal of end-point annotations. In an era of increasingly complex datasets, such tools will help in the formulation of new, testable hypotheses from high-throughput experimental data.

PubMed Disclaimer

Figures

**Figure 1**
**Methodology for clustering a list of proteins by graph similarity of Gene Ontology annotation**. (A) The input to the methodology consists of a list of proteins and selection of one aspect of the Gene Ontology, i.e., Biological Process (BP), Molecular Function (MF) or Cellular Component (CC). The Bioconductor method *simUI* is then employed to generate a matrix of graph similarities between each pair of proteins in the list. (B) The Bioconductor method *silcheck* uses the similarity matrix to select the number of clusters, k. The Bioconductor method *pam* uses the similarity matrix and k to cluster the proteins. (C) The clustering result is then examined in further detail to produce a biological interpretation of the GO annotation of the inputted list of proteins.

**Figure 2**
**Graph similarity scoring method**. The induced GO graphs for two yeast proteins illustrate graph similarity scoring using the Bioconductor method *simUI*. (A) The GO terms GO:0045944 positive regulation of transcription from RNA polymerase II promoter and GO:0008654 phospholipid biosynthesis are assigned to INO4/YOL108C. (B) The GO term GO:0006355 regulation of transcription, DNA-dependent is assigned to RSC30/YHR056C. The graph similarity between these two proteins is calculated by dividing the number of terms that are found in both of the individual induced GO graphs for each protein (shared nodes in blue) by the number of unique terms in both graphs. The graph similarity equals 20 shared nodes/40 unique nodes = 0.5

**Figure 3**
**Silhouette plots of PAM clustering results for Schreiber data set**. Silhouette plots of PAM clustering results for 37 rapamycin-inhibitor binding proteins for GO (A) BP, (B) MF and (C) CC. Proteins assigned either the unknown term from each GO aspect (GO:0000004 biological process unknown, GO:0005554 molecular function unknown and GO:0008372 cellular component unknown) or using the evidence code Inferred from Electronic Annotation were not included in the clustering. Therefore 30 proteins were clustered in BP, 31 in MF and 32 in CC. The silhouette width for the entire set (average silhouette width, s_i^D) is found at the top of each figure whereas the silhouette width for each cluster (s_i^C) is found on the right-hand side of the figure with the cluster number (left of the colon) and number of proteins in each cluster (right of the colon). Each cluster is labelled with the GO annotation of the medoid, except BP cluster 4 as the text did not fit on the figure. Each protein is represented by a bar and the width of the each bar represents the silhouette width for each protein (s_i). * GO annotation for BP cluster 4 is GO:0046856 phosphoinositide dephosphorylation, GO:0048017 inositol lipid-mediated signaling and GO:0030476 spore wall assembly (sensu Fungi).

**Figure 4**
**Induced GO graphs for one cluster from each GO aspect for the Schreiber data set**. Induced GO graphs containing the BP, MF or CC annotation for the proteins found in Schreiber (A) BP cluster 2, (B) MF cluster 2, and (C) CC cluster 8, respectively. Nodes found in all of the individual induced GO graphs for the proteins in the cluster are shown in blue. The silhouette width for each cluster (s_i^C) is shown in the upper right hand corner. The medoid protein for each cluster is italicized and underlined.

**Figure 5**
**Silhouette plots of PAM clustering results for Snyder data set**. Silhouette plots of PAM clustering results for 91 phospholipid binding proteins for GO (A) BP, (B) MF and (C) CC. Proteins assigned either the unknown term from each GO aspect (GO:0000004 biological process unknown, GO:0005554 molecular function unknown and GO:0008372 cellular component unknown) or using the evidence code Inferred from Electronic Annotation were not included in the clustering. Therefore 72 proteins were clustered in BP, 63 in MF and 78 in CC. The silhouette width for the entire set (average silhouette width, s_i^D) is found at the top of each figure whereas the silhouette width for each cluster (s_i^C) is found on the right-hand side of the figure with the cluster number (left of the colon) and number of proteins in each cluster (right of the colon). Each cluster is labelled with the GO annotation of the medoid, except BP cluster 7 and CC clusters 3 and 6 as the text did not fit on the figure. Each protein is represented by a bar and the width of the each bar represents the silhouette width for each protein (s_i). * GO annotation for BP cluster 7 is chromatin silencing [GO:0006342] and histone deacetylation [GO:0016575]. † GO annotation for CC cluster 3 is mitochondrial inner membrane [GO:0005743], integral to membrane [GO:0016021] and mitochondrial nucleoid [GO:0042645]. ‡ GO annotation for CC cluster 6 is plasma membrane [GO:0005886] and integral to membrane [GO:0016021].

**Figure 6**
**Induced GO graphs for one cluster from each GO aspect for the Snyder data set**. Induced GO graphs containing the BP, MF or CC annotation for the proteins found in Snyder (A) BP cluster 3, (B) MF cluster 2, and (C) CC cluster 3, respectively. Nodes found in all of the individual induced GO graphs for the proteins in the cluster are shown in blue. The silhouette width for each cluster (s_i^C) is shown in the upper right hand corner. The medoid protein for each cluster is italicized and underlined.

See this image and copyright information in PMC

Cited by

Semantic similarity in biomedical ontologies.
Pesquita C, Faria D, Falcão AO, Lord P, Couto FM. Pesquita C, et al. PLoS Comput Biol. 2009 Jul;5(7):e1000443. doi: 10.1371/journal.pcbi.1000443. Epub 2009 Jul 31. PLoS Comput Biol. 2009. PMID: 19649320 Free PMC article. Review.
Differential regulation of the immune system in a brain-liver-fats organ network during short-term fasting.
Huang SSY, Makhlouf M, AbouMoussa EH, Ruiz Tejada Segura ML, Mathew LS, Wang K, Leung MC, Chaussabel D, Logan DW, Scialdone A, Garand M, Saraiva LR. Huang SSY, et al. Mol Metab. 2020 Oct;40:101038. doi: 10.1016/j.molmet.2020.101038. Epub 2020 Jun 8. Mol Metab. 2020. PMID: 32526449 Free PMC article.
Multiconstrained gene clustering based on generalized projections.
Zeng J, Zhu S, Liew AW, Yan H. Zeng J, et al. BMC Bioinformatics. 2010 Mar 31;11:164. doi: 10.1186/1471-2105-11-164. BMC Bioinformatics. 2010. PMID: 20356386 Free PMC article.
Biochemical and computational analysis of LNX1 interacting proteins.
Wolting CD, Griffiths EK, Sarao R, Prevost BC, Wybenga-Groot LE, McGlade CJ. Wolting CD, et al. PLoS One. 2011;6(11):e26248. doi: 10.1371/journal.pone.0026248. Epub 2011 Nov 8. PLoS One. 2011. PMID: 22087225 Free PMC article.
Multi-faceted semantic clustering with text-derived phenotypes.
Slater K, Williams JA, Karwath A, Fanning H, Ball S, Schofield PN, Hoehndorf R, Gkoutos GV. Slater K, et al. Comput Biol Med. 2021 Nov;138:104904. doi: 10.1016/j.compbiomed.2021.104904. Epub 2021 Sep 27. Comput Biol Med. 2021. PMID: 34600327 Free PMC article.

See all "Cited by" articles

References

1. Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, Cherry JM. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 2004;32:D311–4. doi: 10.1093/nar/gkh033. - DOI - PMC - PubMed
1. Hirschman JE, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hong EL, Livstone MS, Nash R, Park J, Oughtred R, Skrzypek M, Starr B, Theesfeld CL, Williams J, Andrada R, Binkley G, Dong Q, Lane C, Miyasato S, Sethuraman A, Schroeder M, Thanawala MK, Weng S, Dolinski K, Botstein D, Cherry JM. Genome Snapshot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the Saccharomyces cerevisiae genome. Nucleic Acids Res. 2006;34:D442–5. doi: 10.1093/nar/gkj117. - DOI - PMC - PubMed
1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006;34:D173–80. doi: 10.1093/nar/gkj158. - DOI - PMC - PubMed
1. Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS, Pandey A. Human protein reference database--2006 update. Nucleic Acids Res. 2006;34:D411–4. doi: 10.1093/nar/gkj141. - DOI - PMC - PubMed
1. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187–91. doi: 10.1093/nar/gkj161. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cluster analysis of protein array results via similarity of Gene Ontology annotation

Affiliation

Cluster analysis of protein array results via similarity of Gene Ontology annotation

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials