Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009;10(9):R97.
doi: 10.1186/gb-2009-10-9-r97. Epub 2009 Sep 16.

Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function

Affiliations

Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function

James C Costello et al. Genome Biol. 2009.

Abstract

Background: Discovering the functions of all genes is a central goal of contemporary biomedical research. Despite considerable effort, we are still far from achieving this goal in any metazoan organism. Collectively, the growing body of high-throughput functional genomics data provides evidence of gene function, but remains difficult to interpret.

Results: We constructed the first network of functional relationships for Drosophila melanogaster by integrating most of the available, comprehensive sets of genetic interaction, protein-protein interaction, and microarray expression data. The complete integrated network covers 85% of the currently known genes, which we refined to a high confidence network that includes 20,000 functional relationships among 5,021 genes. An analysis of the network revealed a remarkable concordance with prior knowledge. Using the network, we were able to infer a set of high-confidence Gene Ontology biological process annotations on 483 of the roughly 5,000 previously unannotated genes. We also show that this approach is a means of inferring annotations on a class of genes that cannot be annotated based solely on sequence similarity. Lastly, we demonstrate the utility of the network through reanalyzing gene expression data to both discover clusters of coregulated genes and compile a list of candidate genes related to specific biological processes.

Conclusions: Here we present the the first genome-wide functional gene network in D. melanogaster. The network enables the exploration, mining, and reanalysis of experimental data, as well as the interpretation of new data. The inferred annotations provide testable hypotheses of previously uncharacterized genes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Significant GO:BP terms across datasets. Visualization of how well a dataset connects genes annotated with the same GO:BP term. The dataset names are listed on the left (see Table 1 for citations) and GO:BP terms are listed across the top. All datasets shown are used in the weighted sum (WS) integration. From black to red represents the least significant to the most significant GO:BP terms within a dataset as measured through statistically significant coherence (see the Materials and methods section). Both GO:BP terms and datasets were hierarchically clustered and visualized using TM4 MEV [112]. The colored blocks on the top of the figure highlight similar GO:BP terms selected to show different patterns of significance across the datasets. Marked in brown are oxidative metabolism GO:BP terms, which are significant in most MA datasets but absent from the genetic interaction and protein interaction datasets. Marked in green are cell cycle GO:BP terms, which are well represented across most datasets. Marked in yellow are development and neurogenesis GO:BP terms, which are overrepresented in the Magalhaes et al. [59] dataset (a microarray experiment on axon guidance). Marked in purple are immune response related GO:BP terms, which are well represented in the DeGregorio et al. [57] and Wertheim et al. [58] datasets, both of which tested gene expression of immune response.
Figure 2
Figure 2
Log-likelihood score calculated for a microarray dataset. The log-likelihood score (LLS) compared to the significant correlation coefficients for the Arbeitman et al. [61] microarray dataset. Statistically significant correlation coefficients are rank ordered and separated into bins of 1,000 gene pairs. For example, the right-most black dot represents the top 1,000 ranked gene pairs by correlation coefficient. The black dots are positively correlated gene pairs, while the red circles are the absolute value of the negatively correlated expression profiles. The blue line is the polynomial model fit to the data and used to transform all correlation coefficients to LLSs.
Figure 3
Figure 3
Average KEGG pathway coherence for integration evaluation. The average coherence of 25 KEGG pathways over different weighted sum (WS) integrations at increasing network sizes (number of edges). The dots represent the actual measured values averaged over 25 KEGG pathways, while the lines represent the difference between the actual measured values and random coherence at an equivalent network size. The coherence is measured over networks of increasing size up to one million gene pairs. The grey dashed lines mark the network sizes of 20 K and 200 K, which are the points where the slope (gain in coherence) flattens.
Figure 4
Figure 4
Coherence of types of data and datasets on individual KEGG pathways. Examples of how types of data and individual datasets compare to the fully integrated network as measured through coherence of KEGG pathways [62]. The average coherence of a given dataset is calculated for a set of genes defined by a KEGG pathway at increasing network sizes up to one million edges. (a) The average coherence over 63 tested KEGG pathways. The full integration of genetic interactions, protein interactions, and microarray data performs best compared to all other data sources and individual datasets. (b) A specific example where the fully integrated network performs better than all other individual datasets and in relation to the 'purine metabolism' KEGG pathways. (c) Ribosomal constituents are highly coherent in the microarray data, with many individual microarray datasets performing well. In this instance, not taking into account the genetic interactions and protein interactions performs better than the fully integrated network. (d) An example of where the genetic interactions and protein interactions contribute nearly all of the coherent relationships for the 'Hedgehog signaling' KEGG pathway. (e) An example of where the integration method performs worse than several individual microarray datasets for the 'phenylpropanoid biosynthesis' KEGG pathway. See Table 1 for citations for the datasets.
Figure 5
Figure 5
Composition of edges in the integrated networks. Relative contribution of the different types of data to the integrated network of (a) formula image and (b) formula image. The teal color represents edges that are drawn solely on microarray data. Dark blue represents edges drawn from genetic interactions only and green from protein interactions only. Orange represents edges drawn from both protein interactions and microarray data. Edges drawn from both genetic interactions and microarray data are in red. Purple represents edges supported by both genetic interactions and protein interactions. Lastly, the light blue represents edges supported by genetic interactions, protein interactions, and microarray data. The colors correspond to the edges in Figure 6.
Figure 6
Figure 6
formula image integrated network. Screenshot of formula image visualized in Cytoscape [66]. The edge colors correspond to Figure 5, where, for example, the teal edges are built from only microarray data and the red edges are built from genetic interaction and microarray data.
Figure 7
Figure 7
Precision/recall of GO:BP predictions. Precision and recall plots evaluating GO:BP predictions on unannotated D. melanogaster genes using the MRF method. The black color reflects predictions made from a network size of 20 K and the red color reflects predictions made from a network size of 200 K. For the tenfold cross-validation, (a) precision and (b) recall are shown in relation to the prediction probability (tp). Both precision and recall were measured in relation to all GO:BP predictions and also in relation to the gene (see Materials and methods section for distinction).
Figure 8
Figure 8
Semantic similarity and GO:BP predictions. Series of plots relating the semantic similarity (SS) for tenfold cross-validation to establishing a threshold for the prediction probability, tp. (a) An example illustrating the SS calculation. The nodes represent GO:BP terms, where the topmost node is the root. The red edges are 'is-a' and the blue, dashed edges are 'part-of' relationships in the ontology. Green nodes represent terms that are known and held-out for one gene, while the orange nodes are examples of predicted terms for the same gene. The half orange, half green node is an example where the predicted term perfectly matches a held-out term. The light blue nodes are the ancestor terms that fall within the path to the root, but are not annotated to either of the genes in this example. The SS of (a) is measured to be 0.45 through G-SESAME [73]. (b) Also, SS = 0.45 is the median SS value when measured over all reported and annotated genetic interactions. With respect to the GO:BP predictions, SS was measured by comparing the set of predicted terms to the set of held-out terms. (c,d) The black color reflects predictions made from a network size of 20 K and the red color reflects predictions made from a network size of 200 K. (c) The proportion of genes at a given threshold tp that show a SS measure of > 0.45. (d) The number of predictions made for both integrated networks, formula image and formula image. The top plot in (d) shows the total number of genes with at least one prediction in relation to tp and the bottom bar graph shows the average number of GO:BP terms predicted per gene at a given tp.
Figure 9
Figure 9
Comparing precision/recall for different data sources. An example of precision and recall calculated on the tenfold cross-validation where the prediction probability is tp ≥ 0.5. The colors represent three different networks, all with 20 K edges. Blue represents the network built from only microarray data, red represents the network built from only genetic interactions and protein interactions, and green represents the fully integrated network using genetic interactions, protein interactions, and microarray data. The whiskers show the standard deviation of the precision and recall over the tenfold cross-validation. The squares are the precision and recall measures with respect to the GO:BP terms, while the circles are precision and recall as measured for genes (see Materials and methods section for distinction). Predictions of random GO:BP terms are made and the precision and recall are shown as the squares and circles with a plus in the middle.
Figure 10
Figure 10
Network analysis in coordination with microarray data. Analysis combing the integrated Drosophila gene network and microarray data from Teleman et al. [79]. (a) The network represents the differentially expressed genes in starved versus fed larval muscle tissue that could also be found in formula image. Several examples of categories of genes listed in Teleman et al. are highlighted: cuticle, cellular respiration (Cell. Resp.), signal recognition particle (SRP), mitochondrial ribosomal proteins (mRP), ribosomal proteins (RP), and tRNA synthetases (Aats). The clustering of genes is a result of the integrated network and was done irrespective of the gene expression data from Teleman et al. (b) The subnetwork is the network built from a seeded set of SRP-related genes as defined by Teleman et al. and derived from formula image (see Materials and methods section for seeded network construction). Gene expression ratios reflect wild-type larval muscle tissue upon starvation over wild-type larval muscle tissue under normal feeding conditions, where green represents genes down-regulated upon starvation and red genes up-regulated upon starvation. All nodes with a dark outline are differentially expressed (DE) genes as defined in Teleman et al. The diamond nodes are the seed genes, the circle nodes are genes reported as DE in Teleman et al. but not used as seed genes, and the hexagon nodes are genes not reported as DE by Teleman et al. The genes in the network in (b) were then treated as a gene set and used as input to GSEA [81]. (c) The enrichment plot for all genes in the network in (b). Additionally, we performed an GSEA analysis on the genes in the network in (b) that did not include the seed genes (which corresponds to the set of genes that are circle and hexagon-shaped). (d) The enrichment plot for this set of genes showing that the network places together similarly regulated genes that are still significantly enriched even when the set of genes defined in Teleman et al. were excluded. See Figure S3 at [55] for more detail on the global performance of gene sets. The gene set representing (d) corresponds to the purple line in Figure S3a at [55].

References

    1. The Gene Ontology Consortium The Gene Ontology project in 2008. Nucleic Acids Res. 2008;36:D440–444. doi: 10.1093/nar/gkm883. - DOI - PMC - PubMed
    1. Pena-Castillo L, Hughes TR. Why are there still over 1000 uncharacterized yeast genes? Genetics. 2007;176:7–14. doi: 10.1534/genetics.107.074468. - DOI - PMC - PubMed
    1. Watson J, Laskowski R, Thornton J. Predicting protein function from sequence and structural data. Curr Opin Struct Biol. 2005;15:275–284. doi: 10.1016/j.sbi.2005.04.003. - DOI - PubMed
    1. Rost B, Liu J, Nair R, Wrzeszczynski K, Ofran Y. Automatic prediction of protein function. Cell Mol Life Sci. 2003;60:2637–2650. doi: 10.1007/s00018-003-3114-8. - DOI - PMC - PubMed
    1. Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA. 2003;100:8348–8353. doi: 10.1073/pnas.0832373100. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources