Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008;9 Suppl 1(Suppl 1):S4.
doi: 10.1186/gb-2008-9-s1-s4. Epub 2008 Jun 27.

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function

Affiliations

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function

Sara Mostafavi et al. Genome Biol. 2008.

Abstract

Background: Most successful computational approaches for protein function prediction integrate multiple genomics and proteomics data sources to make inferences about the function of unknown proteins. The most accurate of these algorithms have long running times, making them unsuitable for real-time protein function prediction in large genomes. As a result, the predictions of these algorithms are stored in static databases that can easily become outdated. We propose a new algorithm, GeneMANIA, that is as accurate as the leading methods, while capable of predicting protein function in real-time.

Results: We use a fast heuristic algorithm, derived from ridge regression, to integrate multiple functional association networks and predict gene function from a single process-specific network using label propagation. Our algorithm is efficient enough to be deployed on a modern webserver and is as accurate as, or more so than, the leading methods on the MouseFunc I benchmark and a new yeast function prediction benchmark; it is robust to redundant and irrelevant data and requires, on average, less than ten seconds of computation time on tasks from these benchmarks.

Conclusion: GeneMANIA is fast enough to predict gene function on-the-fly while achieving state-of-the-art accuracy. A prototype version of a GeneMANIA-based webserver is available at http://morrislab.med.utoronto.ca/prototype.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Effect of label bias on ROC scores. Bars show the prediction error measured using 1 - area under the receiver operating characteristic (ROC) curve (1 - AUC) of GeneMANIA where the label bias of unlabeled genes is set to zero or average label (that is, k = 0 or k = mean label). The experiments were run on 400 Gene Ontology (GO) functional classes using 15 yeast association networks that we created from various genomics and proteomics data sources (see Materials and methods). The functional classes are grouped by specificity (defined by number of annotated genes: 3 to 10, 11 to 30, 31 to 100, 101 to 300). Error bars depict the standard error on 100 different predictions in each evaluation category.
Figure 2
Figure 2
Effect of network sparsification on ROC scores. For various sparsity levels of GeneMANIA and for the support vector machine (SVM), boxplot shows the following features of the distribution of the prediction errors as measured with 1 - area under the receiver operating characteristic (ROC) curve (1 - AUC): the median (red line), 25% and 75% percentile (blue box), and outliers of prediction errors more than 1.5 times the interquartile range away from the median (blue stars). The evaluations are based on 3-fold cross-validation on 992 GO categories with the Zhang and coworkers [12] mouse tissue expression data as input. The GeneMANIA experiments were run by creating an association network from the mouse tissue expression data where the number of neighbors for each gene is restricted to N. For example, when the number of neighbors = 5, each gene is associated with only five other genes. The settings for the SVM experiments are as described in [12].
Figure 3
Figure 3
The number of CG iterations and computation time of the GeneMANIA algorithm as a function of number of genes. Left axis: the number of conjugate gradient (CG) iterations until convergence as a function of number of genes in the association networks. Right axis: computation time of GeneMANIA as a function of number of genes in the association networks. Experiments were run using ten association networks from the MouseFunc I benchmark data. The final point on the plot used the full mouse gene complement (for which data are available), and the other gene numbers were derived by taking random subsets of the full gene complement. Distribution is over 100 randomly selected Gene Ontology (GO) categories. The maximum number of CG iterations observed in any test was 20 and the maximum computation time was 15 seconds. The quadratic dependence of computation time on gene number is due to the quadratic growth in number of non-zero association links in the networks as a function of gene number (data not shown).
Figure 4
Figure 4
Prediction performance of GeneMANIA on the MouseFunc I test benchmark. Prediction performance of the first and second submissions to MouseFunc I (GeneMANIAEntry-1 and GeneMANIAEntry-2, respectively) as well as the version of the GeneMANIA algorithm we have implemented on the GeneMANIA webserver (GeneMANIAWS) and the best achieved performance on the MouseFunc I test benchmark. Prediction performance is indicated by mean 1 - area under the receiver operating characteristic curve (1 - AUC) in the class, error bars show one standard error of the mean. Stars mark the evaluation classes in which a GeneMANIA entry achieved lowest error on the test benchmark.
Figure 5
Figure 5
Prediction performance of GeneMANIA on the MouseFunc I novel benchmark. Prediction performance of the first and second submissions to MouseFunc I (GeneMANIAEntry-1 and GeneMANIAEntry-2, respectively) as well as the version of the GeneMANIA algorithm we have implemented on the GeneMANIA webserver (GeneMANIAWS) and the best achieved performance on the MouseFunc I novel benchmark. Stars mark the evaluation categories in which GeneMANIA entries had the best achieved performance on the test benchmark. Bars show mean error (measured as 1 - area under the receiver operating characteristic curve [1 - AUC]), error bars indicate one standard error.
Figure 6
Figure 6
Prediction performance of GeneMANIA on the extended yeast benchmark. Prediction performance of GeneMANIA with five yeast networks with equal weight prior (GM-5WS), 15 yeast networks with equal weight prior (GM-15WS), GeneMANIA with the bioPIXIE network (GM-biPx), and the TSS algorithm with five yeast networks (TSS-5). Bars show mean error (measured as 1 - area under the receiver operating characteristic curve [1 - AUC]) on 12 evaluation classes based on ontologies (biological process [BP] and cellular component [CC]) and specificity levels of 3 to 10, 11 to 30, 31 to 100, and 101 to 300 annotations. Error bars indicate the standard error in the mean.
Figure 7
Figure 7
Computation time and prediction accuracy. Bars to the left of the solid vertical lines show fold increase in error relative to the mean error (as measured by 1 - area under the receiver operating characteristic curve [1 - AUC]) for the evaluation classes defined in Figure 6 of GeneMANIA with 15 yeast networks using the branch-specific weight priors in Figure 6. Bars to the right of the solid vertical lines show mean CPU time required to run each algorithm. The performance of GeneMANIA (GM), TSS, and bioPIXIE (biPx) are directly compared on the same input. The bars marked as GM-biPx and PGS-biPx depict the prediction performance of GeneMANIA label propagation and bioPIXIE probabilistic graph based search algorithm, respectively, using the bioPIXIE network as input. TSS and GeneMANIA are compared using the five yeast network benchmark.
Figure 8
Figure 8
Prediction performance of GeneMANIA in the presence of irrelevant and redundant networks. Cumulative distribution of 1 - area under the receiver operating characteristic curve (1 - AUC) scores on 300 yeast Gene Ontology (GO) categories using GeneMANIA optimized weights and equal weights in the presence of redundant and irrelevant networks.

References

    1. Stuart JM, Segal E, Koller D, Kim SK. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. - PubMed
    1. Zhang LV, King OD, Wong SL, Goldberg DS, Tong AH, Lesage G, Andrews B, Bussey H, Boone C, Roth FP. Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network. J Biol. 2005;4:6. - PMC - PubMed
    1. Giaever G, Shoemaker DD, Jones TW, Liang H, Winzeler EA, Astromoff A, Davis RW. Genomic profiling of drug sensitivities via induced haploinsufficiency. Nat Genet. 1999;21:278–283. - PubMed
    1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. - PubMed
    1. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002;417:399–403. - PubMed

Publication types