. 2010 May 19:11:265.

doi: 10.1186/1471-2105-11-265.

Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

Troy Hawkins¹, Meghana Chitale, Daisuke Kihara

Affiliations

PMID: 20482861
PMCID: PMC2882935
DOI: 10.1186/1471-2105-11-265

Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

Troy Hawkins et al. BMC Bioinformatics. 2010.

. 2010 May 19:11:265.

doi: 10.1186/1471-2105-11-265.

Authors

Troy Hawkins¹, Meghana Chitale, Daisuke Kihara

Affiliation

¹ Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA.

PMID: 20482861
PMCID: PMC2882935
DOI: 10.1186/1471-2105-11-265

Abstract

Background: A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance.

Results: Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted.

Conclusion: The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact on a researchers' ability to interpret the immense biological data that are being generated today. The newly introduced functional similarity networks of the three organisms show different network properties as compared with the protein-protein interaction networks.

PubMed Disclaimer

Figures

**Figure 1**
**Protein-protein interaction networks used in this work**. Networks are visualized by Cytoscape [58]: A, *E. coli*; B, *S. cerevisiae*; C, *P. falciparum*.

**Figure 2**
**Enrichment of function annotation in protein-protein interaction networks**. The networks show a total of 8565, 1376, and 2542 interactions for *E. coli*, *S. cerevisiae*, and *P. falciparum*, respectively. The fraction of interactions where both proteins are not annotated (none), interactions where one of the two proteins are annotated (one), and interactions where both proteins are annotated (both) are shown in the original annotation in the GOA database and after adding high confidence function prediction by PFP. Enrichment of three categories of GO, BP, MF, CC, are shown separately.

**Figure 3**
**Functional similarity networks**. A, *E. coli*; B, *S. cerevisiae*; C, *P. falciparum*. From left to right, *BP-score*, *CC-score*, *MF-score*, and *funSim* matrices. Nodes represent individual proteins and edges represent a category *GOscore* or *funSim* of ≥ 0.95. Individual clusters in the functional similarity networks are highlighted in color to show functional category of proteins. For the *BP-score* networks (left panels), green nodes represent proteins involved in transcription (GO:0006350 and its children nodes), blue nodes represent proteins involved in transport (GO:0006810), purple nodes represent proteins involved in pathogenesis (GO:0009405) (for *P. falciparum*, Fig. 3C) or signaling (GO:0007165) (for *E. coli* and yeast, Fig 3A,B), and red nodes represent proteins involved in protein modification (GO:0043687). For the *CC-score* networks (the second panels from the left in Fig. 3), yellow nodes represent proteins localized in the membrane (GO:0016020), orange nodes represent proteins localized in the ribosome (GO:0005840), and blue nodes represent proteins localized in the cell wall (GO:0005618) (for *E. coli*) or in the nucleus (GO:0005634) (for malaria and yeast). For the *MF-score* networks (the second panels from the right), light green nodes represent proteins which bind ATP (GO:0005524), pink nodes represent proteins which bind rRNA (GO:0019843), light purple nodes represent proteins which bind ions (GO:0043167), and olive nodes represent proteins exhibiting transporter activity (GO:0005215). For the *FunSim* networks (the panels on the right), burgundy nodes represent proteins which bind ATP (GO:0005524), blue nodes represent proteins localized in the ribosome (GO:0005840), and light green nodes represent proteins exhibiting transmembrane receptor activity (GO:0004888).

**Figure 4**
**Degree distribution of the functional similarity networks**. The similarity threshold value of 0.95 are used to connect edges. The X-axis is the number of interactions, k (the degree of interactions) and the Y-axis is the probability of proteins with a certain number of interactions, *P(k)*. Both axes are log scaled. The dotted line is fit to the data to compute the degree exponent, γ, in the power-law degree distribution: *P(k)~ k*^-γ. A, *E. coli*; B, *S. cerevisiae*; C, *P. falciparum*. From left to right, the *BP-score*, *CC-score*, *MF-score*, and the *funSim* score. The degree exponent values are shown in Table 3. The R²value of the fitted line to each distribution is as follows. *E. coli*: 0.579 (BP), 0.144 (CC), 0.472 (MF), 0.872 (funSim); *S. cerevisiae*: 0.585 (BP), 0.481 (CC), 0.505 (MF), 0.798 (funSim); *P. falciparum*: 0.466 (BP), 0.345 (CC), 0.068 (MF), 0.825 (funSim).

**Figure 5**
**Hierarchical modularity of networks**. *C(k)* is plotted relative to k. A, the PPI networks; B, the *funSim* networks. The dotted lines corresponds to *C(k) ~ k*^-1.

**Figure 6**
**The clustering degree exponent value of the functional similarity networks relative to the number of edges in the networks**. A, *E. coli*; B, *S. cerevisiae*; C, *P. falciparum*.

**Figure 7**
The increase of the average score similarity of subnetworks of *P. falciparum*. The score before and after adding function prediction by PFP to the 152 subnetworks are compared. A, *BP-score*; B, *CC-score*; C, *MF-score*; and D, *funSim* score.

**Figure 8**
**Protein-protein interaction subnetworks described in Table 3**. Proteins in the center of the subnetworks are: A, Q8I1Q4; B, Q8I206; C, Q8I255; D, Q8I562; E, Q8I5X5; F, Q8IKV2. Previously annotated proteins are colored red and proteins with functions predicted by PFP are colored yellow. Circular edges are self-interactions detected for the proteins. See Table 5 for function annotations of the proteins in these subnetworks.

**Figure 9**
**The accumulated fraction of genomic windows in *E. coli* that satisfy the similarity threshold values**. Results for the *funSim* score and individual *GO scores* are shown.

**Figure 10**
**Variability of functional similarity in the *E. coli* genome**. Functional similarity (Y-axis) here is an all-by-all category *GO score* or *funSim* average among the genes included in the local window. The X-axis is the genome position of the left-hand side of the window. The red line indicates the threshold value of functional similarity we used for individual analysis of a genome window for overrepresentation of GO terms (0.7 for each category *GO score* average, 0.49 for *funSim* average). The dots denote known clusters of functionally similar genes. For the BP graph, neon green is the *lac* operon, pink is the *trp* operon, and dark blue is the *his* operon. For the MF graph, dark red dots are ATP synthase components (atpX). And for the CC graph, dark green dots are proteins of the ribosome. The same plots for yeast and malaria genomes are not provided since they have much larger genomes (yeast and malaria have 16 and 14 chromosomes, respectively) but all the data are available on our website.

See this image and copyright information in PMC

Cited by

Quantification of protein group coherence and pathway assignment using functional association.
Chitale M, Palakodety S, Kihara D. Chitale M, et al. BMC Bioinformatics. 2011 Sep 19;12:373. doi: 10.1186/1471-2105-12-373. BMC Bioinformatics. 2011. PMID: 21929787 Free PMC article.
Structure- and sequence-based function prediction for non-homologous proteins.
Sael L, Chitale M, Kihara D. Sael L, et al. J Struct Funct Genomics. 2012 Jun;13(2):111-23. doi: 10.1007/s10969-012-9126-6. Epub 2012 Jan 22. J Struct Funct Genomics. 2012. PMID: 22270458 Free PMC article.
Revisiting the variation of clustering coefficient of biological networks suggests new modular structure.
Hao D, Ren C, Li C. Hao D, et al. BMC Syst Biol. 2012 May 1;6:34. doi: 10.1186/1752-0509-6-34. BMC Syst Biol. 2012. PMID: 22548803 Free PMC article.
Computational identification of protein-protein interactions in model plant proteomes.
Ding Z, Kihara D. Ding Z, et al. Sci Rep. 2019 Jun 19;9(1):8740. doi: 10.1038/s41598-019-45072-8. Sci Rep. 2019. PMID: 31217453 Free PMC article.
A network-based gene-weighting approach for pathway analysis.
Fang Z, Tian W, Ji H. Fang Z, et al. Cell Res. 2012 Mar;22(3):565-80. doi: 10.1038/cr.2011.149. Epub 2011 Sep 6. Cell Res. 2012. PMID: 21894192 Free PMC article.

See all "Cited by" articles

References

1. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
1. Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008;24:142–149. - PMC - PubMed
1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
1. Hoheisel JD. Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet. 2006;7:200–210. doi: 10.1038/nrg1809. - DOI - PubMed
1. Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, Tsuzuki K, Nakamura S, taf-Ul-Amin M, Oshima T, Baba T, Yamamoto N, Kawamura T, Ioka-Nakamichi T, Kitagawa M, Tomita M, Kanaya S, Wada C, Mori H. Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res. 2006;16:686–691. doi: 10.1101/gr.4527806. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

Affiliation

Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases