Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jul 12:7:338.
doi: 10.1186/1471-2105-7-338.

Cluster analysis of protein array results via similarity of Gene Ontology annotation

Affiliations

Cluster analysis of protein array results via similarity of Gene Ontology annotation

Cheryl Wolting et al. BMC Bioinformatics. .

Abstract

Background: With the advent of high-throughput proteomic experiments such as arrays of purified proteins comes the need to analyse sets of proteins as an ensemble, as opposed to the traditional one-protein-at-a-time approach. Although there are several publicly available tools that facilitate the analysis of protein sets, they do not display integrated results in an easily-interpreted image or do not allow the user to specify the proteins to be analysed.

Results: We developed a novel computational approach to analyse the annotation of sets of molecules. As proof of principle, we analysed two sets of proteins identified in published protein array screens. The distance between any two proteins was measured as the graph similarity between their Gene Ontology (GO) annotations. These distances were then clustered to highlight subsets of proteins sharing related GO annotation. In the first set of proteins found to bind small molecule inhibitors of rapamycin, we identified three subsets containing four or five proteins each that may help to elucidate how rapamycin affects cell growth whereas the original authors chose only one novel protein from the array results for further study. In a set of phosphoinositide-binding proteins, we identified subsets of proteins associated with different intracellular structures that were not highlighted by the analysis performed in the original publication.

Conclusion: By determining the distances between annotations, our methodology reveals trends and enrichment of proteins of particular functions within high-throughput datasets at a higher sensitivity than perusal of end-point annotations. In an era of increasingly complex datasets, such tools will help in the formulation of new, testable hypotheses from high-throughput experimental data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Methodology for clustering a list of proteins by graph similarity of Gene Ontology annotation. (A) The input to the methodology consists of a list of proteins and selection of one aspect of the Gene Ontology, i.e., Biological Process (BP), Molecular Function (MF) or Cellular Component (CC). The Bioconductor method simUI is then employed to generate a matrix of graph similarities between each pair of proteins in the list. (B) The Bioconductor method silcheck uses the similarity matrix to select the number of clusters, k. The Bioconductor method pam uses the similarity matrix and k to cluster the proteins. (C) The clustering result is then examined in further detail to produce a biological interpretation of the GO annotation of the inputted list of proteins.
Figure 2
Figure 2
Graph similarity scoring method. The induced GO graphs for two yeast proteins illustrate graph similarity scoring using the Bioconductor method simUI. (A) The GO terms GO:0045944 positive regulation of transcription from RNA polymerase II promoter and GO:0008654 phospholipid biosynthesis are assigned to INO4/YOL108C. (B) The GO term GO:0006355 regulation of transcription, DNA-dependent is assigned to RSC30/YHR056C. The graph similarity between these two proteins is calculated by dividing the number of terms that are found in both of the individual induced GO graphs for each protein (shared nodes in blue) by the number of unique terms in both graphs. The graph similarity equals 20 shared nodes/40 unique nodes = 0.5
Figure 3
Figure 3
Silhouette plots of PAM clustering results for Schreiber data set. Silhouette plots of PAM clustering results for 37 rapamycin-inhibitor binding proteins for GO (A) BP, (B) MF and (C) CC. Proteins assigned either the unknown term from each GO aspect (GO:0000004 biological process unknown, GO:0005554 molecular function unknown and GO:0008372 cellular component unknown) or using the evidence code Inferred from Electronic Annotation were not included in the clustering. Therefore 30 proteins were clustered in BP, 31 in MF and 32 in CC. The silhouette width for the entire set (average silhouette width, siD) is found at the top of each figure whereas the silhouette width for each cluster (siC) is found on the right-hand side of the figure with the cluster number (left of the colon) and number of proteins in each cluster (right of the colon). Each cluster is labelled with the GO annotation of the medoid, except BP cluster 4 as the text did not fit on the figure. Each protein is represented by a bar and the width of the each bar represents the silhouette width for each protein (si). * GO annotation for BP cluster 4 is GO:0046856 phosphoinositide dephosphorylation, GO:0048017 inositol lipid-mediated signaling and GO:0030476 spore wall assembly (sensu Fungi).
Figure 4
Figure 4
Induced GO graphs for one cluster from each GO aspect for the Schreiber data set. Induced GO graphs containing the BP, MF or CC annotation for the proteins found in Schreiber (A) BP cluster 2, (B) MF cluster 2, and (C) CC cluster 8, respectively. Nodes found in all of the individual induced GO graphs for the proteins in the cluster are shown in blue. The silhouette width for each cluster (siC) is shown in the upper right hand corner. The medoid protein for each cluster is italicized and underlined.
Figure 5
Figure 5
Silhouette plots of PAM clustering results for Snyder data set. Silhouette plots of PAM clustering results for 91 phospholipid binding proteins for GO (A) BP, (B) MF and (C) CC. Proteins assigned either the unknown term from each GO aspect (GO:0000004 biological process unknown, GO:0005554 molecular function unknown and GO:0008372 cellular component unknown) or using the evidence code Inferred from Electronic Annotation were not included in the clustering. Therefore 72 proteins were clustered in BP, 63 in MF and 78 in CC. The silhouette width for the entire set (average silhouette width, siD) is found at the top of each figure whereas the silhouette width for each cluster (siC) is found on the right-hand side of the figure with the cluster number (left of the colon) and number of proteins in each cluster (right of the colon). Each cluster is labelled with the GO annotation of the medoid, except BP cluster 7 and CC clusters 3 and 6 as the text did not fit on the figure. Each protein is represented by a bar and the width of the each bar represents the silhouette width for each protein (si). * GO annotation for BP cluster 7 is chromatin silencing [GO:0006342] and histone deacetylation [GO:0016575]. † GO annotation for CC cluster 3 is mitochondrial inner membrane [GO:0005743], integral to membrane [GO:0016021] and mitochondrial nucleoid [GO:0042645]. ‡ GO annotation for CC cluster 6 is plasma membrane [GO:0005886] and integral to membrane [GO:0016021].
Figure 6
Figure 6
Induced GO graphs for one cluster from each GO aspect for the Snyder data set. Induced GO graphs containing the BP, MF or CC annotation for the proteins found in Snyder (A) BP cluster 3, (B) MF cluster 2, and (C) CC cluster 3, respectively. Nodes found in all of the individual induced GO graphs for the proteins in the cluster are shown in blue. The silhouette width for each cluster (siC) is shown in the upper right hand corner. The medoid protein for each cluster is italicized and underlined.

Similar articles

Cited by

References

    1. Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Nash R, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Botstein D, Cherry JM. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 2004;32:D311–4. doi: 10.1093/nar/gkh033. - DOI - PMC - PubMed
    1. Hirschman JE, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hong EL, Livstone MS, Nash R, Park J, Oughtred R, Skrzypek M, Starr B, Theesfeld CL, Williams J, Andrada R, Binkley G, Dong Q, Lane C, Miyasato S, Sethuraman A, Schroeder M, Thanawala MK, Weng S, Dolinski K, Botstein D, Cherry JM. Genome Snapshot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the Saccharomyces cerevisiae genome. Nucleic Acids Res. 2006;34:D442–5. doi: 10.1093/nar/gkj117. - DOI - PMC - PubMed
    1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006;34:D173–80. doi: 10.1093/nar/gkj158. - DOI - PMC - PubMed
    1. Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R, Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS, Pandey A. Human protein reference database--2006 update. Nucleic Acids Res. 2006;34:D411–4. doi: 10.1093/nar/gkj141. - DOI - PMC - PubMed
    1. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187–91. doi: 10.1093/nar/gkj161. - DOI - PMC - PubMed

Publication types