Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 May 19:11:265.
doi: 10.1186/1471-2105-11-265.

Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

Affiliations

Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

Troy Hawkins et al. BMC Bioinformatics. .

Abstract

Background: A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance.

Results: Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted.

Conclusion: The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact on a researchers' ability to interpret the immense biological data that are being generated today. The newly introduced functional similarity networks of the three organisms show different network properties as compared with the protein-protein interaction networks.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Protein-protein interaction networks used in this work. Networks are visualized by Cytoscape [58]: A, E. coli; B, S. cerevisiae; C, P. falciparum.
Figure 2
Figure 2
Enrichment of function annotation in protein-protein interaction networks. The networks show a total of 8565, 1376, and 2542 interactions for E. coli, S. cerevisiae, and P. falciparum, respectively. The fraction of interactions where both proteins are not annotated (none), interactions where one of the two proteins are annotated (one), and interactions where both proteins are annotated (both) are shown in the original annotation in the GOA database and after adding high confidence function prediction by PFP. Enrichment of three categories of GO, BP, MF, CC, are shown separately.
Figure 3
Figure 3
Functional similarity networks. A, E. coli; B, S. cerevisiae; C, P. falciparum. From left to right, BP-score, CC-score, MF-score, and funSim matrices. Nodes represent individual proteins and edges represent a category GOscore or funSim of ≥ 0.95. Individual clusters in the functional similarity networks are highlighted in color to show functional category of proteins. For the BP-score networks (left panels), green nodes represent proteins involved in transcription (GO:0006350 and its children nodes), blue nodes represent proteins involved in transport (GO:0006810), purple nodes represent proteins involved in pathogenesis (GO:0009405) (for P. falciparum, Fig. 3C) or signaling (GO:0007165) (for E. coli and yeast, Fig 3A,B), and red nodes represent proteins involved in protein modification (GO:0043687). For the CC-score networks (the second panels from the left in Fig. 3), yellow nodes represent proteins localized in the membrane (GO:0016020), orange nodes represent proteins localized in the ribosome (GO:0005840), and blue nodes represent proteins localized in the cell wall (GO:0005618) (for E. coli) or in the nucleus (GO:0005634) (for malaria and yeast). For the MF-score networks (the second panels from the right), light green nodes represent proteins which bind ATP (GO:0005524), pink nodes represent proteins which bind rRNA (GO:0019843), light purple nodes represent proteins which bind ions (GO:0043167), and olive nodes represent proteins exhibiting transporter activity (GO:0005215). For the FunSim networks (the panels on the right), burgundy nodes represent proteins which bind ATP (GO:0005524), blue nodes represent proteins localized in the ribosome (GO:0005840), and light green nodes represent proteins exhibiting transmembrane receptor activity (GO:0004888).
Figure 4
Figure 4
Degree distribution of the functional similarity networks. The similarity threshold value of 0.95 are used to connect edges. The X-axis is the number of interactions, k (the degree of interactions) and the Y-axis is the probability of proteins with a certain number of interactions, P(k). Both axes are log scaled. The dotted line is fit to the data to compute the degree exponent, γ, in the power-law degree distribution: P(k)~ k-γ. A, E. coli; B, S. cerevisiae; C, P. falciparum. From left to right, the BP-score, CC-score, MF-score, and the funSim score. The degree exponent values are shown in Table 3. The R2 value of the fitted line to each distribution is as follows. E. coli: 0.579 (BP), 0.144 (CC), 0.472 (MF), 0.872 (funSim); S. cerevisiae: 0.585 (BP), 0.481 (CC), 0.505 (MF), 0.798 (funSim); P. falciparum: 0.466 (BP), 0.345 (CC), 0.068 (MF), 0.825 (funSim).
Figure 5
Figure 5
Hierarchical modularity of networks. C(k) is plotted relative to k. A, the PPI networks; B, the funSim networks. The dotted lines corresponds to C(k) ~ k-1.
Figure 6
Figure 6
The clustering degree exponent value of the functional similarity networks relative to the number of edges in the networks. A, E. coli; B, S. cerevisiae; C, P. falciparum.
Figure 7
Figure 7
The increase of the average score similarity of subnetworks of P. falciparum. The score before and after adding function prediction by PFP to the 152 subnetworks are compared. A, BP-score; B, CC-score; C, MF-score; and D, funSim score.
Figure 8
Figure 8
Protein-protein interaction subnetworks described in Table 3. Proteins in the center of the subnetworks are: A, Q8I1Q4; B, Q8I206; C, Q8I255; D, Q8I562; E, Q8I5X5; F, Q8IKV2. Previously annotated proteins are colored red and proteins with functions predicted by PFP are colored yellow. Circular edges are self-interactions detected for the proteins. See Table 5 for function annotations of the proteins in these subnetworks.
Figure 9
Figure 9
The accumulated fraction of genomic windows in E. coli that satisfy the similarity threshold values. Results for the funSim score and individual GO scores are shown.
Figure 10
Figure 10
Variability of functional similarity in the E. coli genome. Functional similarity (Y-axis) here is an all-by-all category GO score or funSim average among the genes included in the local window. The X-axis is the genome position of the left-hand side of the window. The red line indicates the threshold value of functional similarity we used for individual analysis of a genome window for overrepresentation of GO terms (0.7 for each category GO score average, 0.49 for funSim average). The dots denote known clusters of functionally similar genes. For the BP graph, neon green is the lac operon, pink is the trp operon, and dark blue is the his operon. For the MF graph, dark red dots are ATP synthase components (atpX). And for the CC graph, dark green dots are proteins of the ribosome. The same plots for yeast and malaria genomes are not provided since they have much larger genomes (yeast and malaria have 16 and 14 chromosomes, respectively) but all the data are available on our website.

Similar articles

Cited by

References

    1. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008;24:142–149. - PMC - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Hoheisel JD. Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet. 2006;7:200–210. doi: 10.1038/nrg1809. - DOI - PubMed
    1. Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, Tsuzuki K, Nakamura S, taf-Ul-Amin M, Oshima T, Baba T, Yamamoto N, Kawamura T, Ioka-Nakamichi T, Kitagawa M, Tomita M, Kanaya S, Wada C, Mori H. Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res. 2006;16:686–691. doi: 10.1101/gr.4527806. - DOI - PMC - PubMed

Publication types

LinkOut - more resources