Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jul 15;26(14):1759-65.
doi: 10.1093/bioinformatics/btq262. Epub 2010 May 27.

Fast integration of heterogeneous data sources for predicting gene function with limited annotation

Affiliations

Fast integration of heterogeneous data sources for predicting gene function with limited annotation

Sara Mostafavi et al. Bioinformatics. .

Abstract

Motivation: Many algorithms that integrate multiple functional association networks for predicting gene function construct a composite network as a weighted sum of the individual networks and then use the composite network to predict gene function. The weight assigned to an individual network represents the usefulness of that network in predicting a given gene function. However, because many categories of gene function have a small number of annotations, the process of assigning these network weights is prone to overfitting.

Results: Here, we address this problem by proposing a novel approach to combining multiple functional association networks. In particular, we present a method where network weights are simultaneously optimized on sets of related function categories. The method is simpler and faster than existing approaches. Further, we show that it produces composite networks with improved function prediction accuracy using five example species (yeast, mouse, fly, Esherichia coli and human).

Availability: Networks and code are available from: http://morrislab.med.utoronto.ca/sara/SW

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) Comparison of performance of LASSO, elastic net (ElasticNet), unregularized linear regression (Unregularized), ridge with uniform prior [Ridge (Uniform)], ridge with mean prior [Ridge (mean)], SW and a network combination with uniform weights (Uniform) in predicting BP categories with [3–10] (n = 635), [11–30] (n = 305), [31–100] (n = 191), [101–300] (n = 57) and [3–300] (n = 1188) annotations. Error bars show one standard error. (b) Mean precision of combined and individual data sources (separated by publications) in predicting BP categories. When applicable (in the case of protein and genetic interaction) we combined all networks derived from the same publication (e.g. direct and correlation network). The combined network was constructed using SW.
Fig. 2.
Fig. 2.
Performance of SW, TSS and correlation in predicting gene function in yeast according to BP categories.
Fig. 3.
Fig. 3.
Each colored bar represents the average weight assigned to each network while predicting 1188 gene functions. Networks are divided into four types (i) co-localization (network 1), (ii) gene expressions (networks 2–7), (iii) protein interaction (networks 8–25) and genetic interactions (networks 26–44).
Fig. 4.
Fig. 4.
Comparison of performance of unregularized linear regression (Unreg), SW and a fixed uniform combination of networks in predicting gene function in fly (a and e), mouse (b and f), human (c and g) and E.coli (d and h). The bars show average performance in BP categories with [3–10] (n = 1101 for fly, 952 for mouse, for 1188 for human, 528 for E.coli) [11–30](n = 668 for fly, 435 for mouse, 510 for human and 177 for E.coli), [31–100] (n = 426 for fly, 239 for mouse, 254 for human and 104 for E.coli) and [3–100] (overall). Error bars show the standard error. Asterisk indicate significant difference in overall performance ([3–100] category size range) using paired Wilcoxon signed rank test with a Bonferroni correction: double asterisk indicate SW performs significantly better than both of the other methods, asterisk indicates that the differences were significant only between SW and unregularized.

References

    1. Ashburner M, et al. Gene ontology: tool for unification of biology. Nat. Genet. 2000;25:25–29. - PMC - PubMed
    1. Bairoch A. The enzyme database in 2000. Nucleic Acids Res. 2000;28:304–305. - PMC - PubMed
    1. Cristianini N, et al. Proceedings of the Fourteen Conference on Advances in Neural Information Processing Systems. Vancouver, BC, Canada: 2002. On kernel target alignment; pp. 367–373.
    1. Edgar R, et al. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. - PMC - PubMed
    1. Efron B, et al. Least angle regression. Ann. Stat. 2004;32:407–499.

Publication types