Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Jul 8;100(14):8348-53.
doi: 10.1073/pnas.0832373100. Epub 2003 Jun 25.

A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)

Affiliations

A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae)

Olga G Troyanskaya et al. Proc Natl Acad Sci U S A. .

Abstract

Genomic sequencing is no longer a novelty, but gene function annotation remains a key challenge in modern biology. A variety of functional genomics experimental techniques are available, from classic methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. In the future, more disparate methods will be developed, further increasing the need for integrated computational analysis of data generated by these studies. We address this problem with MAGIC (Multisource Association of Genes by Integration of Clusters), a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data (such as large-scale two-hybrid screens and multiple microarray analyses) for accurate gene function prediction. The system formally incorporates expert knowledge about relative accuracies of data sources to combine them within a normative framework. MAGIC provides a belief level with its output that allows the user to vary the stringency of predictions. We applied MAGIC to Saccharomyces cerevisiae genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the Saccharomyces Genome Database. We found that by creating functional groupings based on heterogeneous data types, MAGIC improved accuracy of the groupings compared with microarray analysis alone. We describe several of the biological gene groupings identified.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
General architecture of the magic Bayesian Network. A separate network is instantiated for each pair of genes by initializing bottom-level nodes with evidence. Conditional probability tables for each connection were assessed formally from yeast genetics experts. The network contains discrete nodes and uses the clustering algorithm for belief updating, as initially proposed in ref. . The combination of outputs of expression clustering methods is performed through a single “Coexpression” node, which allows all of the expression analysis method's outputs for one dataset to be combined based on each method's characteristics, such as robustness to noise level in data or optimality for a specific data type (e.g., temporal data). The input nodes for expression-based clustering methods (K-means Clustering, Self Organizing Maps, and Hierarchical Clustering) incorporate pairwise data binned into three categories: high, medium, and low confidence, based on Pearson correlation to the cluster centroid (see supporting information). Nonexpression-based data are incorporated through binary input nodes for colocalization data, experimentally identified transcription factor binding sites, and various experimental evidence for physical or genetic associations of two proteins. The genetic and physical relationship data are divided into experimental evidence types according to the GRID database (http://biodata.mshri.on.ca/grid/servlet/HelpHtmlPages?pageID=3; see supporting information for details).
Fig. 2.
Fig. 2.
Tradeoff between the number of TP and FP pairs for each method. (A) magic increases the proportion of TP pairs in a broad high-specificity region compared with expression-based clustering methods, magic based on purely microarray data (magic-microarray only) or purely on nonexpression data (magic-nonexpression only). (B) Comparison in the region of highest accuracy (<1,000 TP pairs). magic predicts more TP pairs than its input methods for each number of FPs.
Fig. 3.
Fig. 3.
Protein biosynthesis group identified by magic, represented using GO Term Finder (http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermFinder). The color of each GO term is associated with its P value, representing the level of significance of that GO term's assignment to the cluster (see http://genome-www.stanford.edu/Saccharomyces/help/goTermFinder.html). Only known genes associated with protein biosynthesis are shown. The cluster contains 49 genes annotated to protein biosynthesis and 10 unknown genes. It also includes nine genes not directly annotated to protein biosynthesis but involved in potentially related processes: three genes involved in ribosome biogenesis and assembly (RRB1, SIK1, and CBF5), two transcription-related genes (RPA49 and RPC40), two involved in budding and sporulation (BUD28 and LSG1), and PRS1, a ribose-phosphate pyrophosphokinase involved in histidine biosynthesis.
Fig. 4.
Fig. 4.
Ubiquitin-dependent protein catabolism cluster represented using GO Term Finder (http://genome-www4.stanford.edu/cgi-bin/SGD/GO/goTermFinder). The cluster contains 12 genes. In the version of SGD annotations used for evaluation in this study, nine of the proteins are annotated to ubiquitin-dependent protein catabolism, one (RAD23) is annotated to “nucleotide excision repair,” and YNL311C and YGL004C do not have a known biological process assignment. magic predicted that YNL311C and YGL004C are likely involved in ubiquitin-dependent protein catabolism. In the most recent release of the annotation (February 2003), YNL311C has been annotated to this process. The other unknown ORF, YGL004C, is annotated as biological process unknown (not shown), but has been assigned the Saccharomyces Genome Database reserved name RPN14. This example illustrates the utility of magic as a tool to aid gene function annotation.

References

    1. Larsson, P. O. & Mosbach, K. (1979) FEBS Lett. 98 333-338. - PubMed
    1. Fields, S. & Song, O. (1989) Nature 340 245-246. - PubMed
    1. Novick, P., Osmond, B. C. & Botstein, D. (1989) Genetics 121 659-674. - PMC - PubMed
    1. Bender, A. & Pringle, J. R. (1991) Mol. Cell. Biol. 11 1295-1305. - PMC - PubMed
    1. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) Science 270 467-470. - PubMed

Publication types

Substances

LinkOut - more resources