Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;6 Suppl 3(Suppl 3):S7.
doi: 10.1186/1752-0509-6-S3-S7. Epub 2012 Dec 17.

Revealing functionally coherent subsets using a spectral clustering and an information integration approach

Affiliations

Revealing functionally coherent subsets using a spectral clustering and an information integration approach

Adam J Richards et al. BMC Syst Biol. 2012.

Abstract

Background: Contemporary high-throughput analyses often produce lengthy lists of genes or proteins. It is desirable to divide the genes into functionally coherent subsets for further investigation, by integrating heterogeneous information regarding the genes. Here we report a principled approach for managing and integrating multiple data sources within the framework of graph-spectrum analysis in order to identify coherent gene subsets.

Results: We investigated several approaches to integrate information derived from different sources that reflect distinct aspects of gene functional relationships including: functional annotations of genes in the form of the Gene Ontology, co-mentioning of genes in the literature, and shared transcription factor binding sites among genes. Given a list of genes, we construct a graph containing the genes in each information space; then the graphs were kernel transformed so they could be integrated; finally functionally coherent subsets were identified using a spectral clustering algorithm. In a series of simulation experiments, known functionally coherent gene sets were mixed and recovered using our approach.

Conclusions: The results indicate that spectral clustering approaches are capable of recovering coherent gene modules even under noisy conditions, and that information integration serves to further enhance this capability. When applied to a real-world data set, our methods revealed biologically sensible modules, and highlighted the importance of information integration. The implementation of the statistical model is provided under the GNU general public license, as an installable Python module, at: http://code.google.com/p/spectralmix.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Conceptual diagram of analysis pipeline. In this example, information is combined to partition seven genes. A Input gene-gene relationships are gathered from databases or via other approaches, like experimentation. B A web server is used to store, organize and provide programmatic access to the data. C Given a gene list of interest, graphs are constructed with connections among the genes representing gene-gene information for a given information source. D Graph edge weights are transformed into kernel space using gene-gene distances. The affinity-weighted graphs are combined into a summarizing graph and subsequently the summarizing affinities are projected into eigen-decomposition space, where the genes are partitioned.
Figure 2
Figure 2
Evaluating the algorithm's discriminative abilities. Using simulations and spectral clustering as described in the methods, algorithm performance is summarized by recall (A), precision (B) and F1 scores (C) as a function of increasing k. Each bar represents an average of 20 simulations, and for each value of k the same data and cluster assignments were used to find the shown recall, precision, and F1 score. The dark gray portions of each bar are the results if cluster assignments were randomly guessed. For these simulations the organism used was S. cerevisiae, and the information source was the Gene Ontology. Standard error bars are included.
Figure 3
Figure 3
Literature and Gene Ontology integration. The discriminative abilities as represented by recall, precision and F1 scores, are shown for the Gene Ontology, PubMed and combined simulations. Each subplot from left to right is a summary of 120 individual simulations for the species S. cerevisiae, M. musculus, and H. sapiens, respectively. The rows correspond to simulations run with the KEGG and MINT positive control modules. Each set of 120 simulations was comprised of a mixture of individual runs, where the number of pathways ranged from 3-8. Standard error bars are given for each discriminative measure, and for each of the three species. Significance was tested for across the data source combinations for each evaluator independently.
Figure 4
Figure 4
Recovering functional modules contaminated with noise. The original gene lists for KEGG and MINT (N = 22 and N = 24 respectively) were compiled and increasing levels of noise were added to each set, where each was then considered a new list. Spectral clustering was run on each gene list and statistical significance based on functional coherence was determined for all the the underlying modules. Statistically significant modules were assembled and together they made up the positively labeled genes. Shown here are the averaged results of the gene sets, with standard error bars at each level of noise.
Figure 5
Figure 5
Application to gene expression example. The genes of interest (458 total) were partitioned according to one or more information sources and the resulting subsets were assessed for functional significance. Each bar designates an average of p-values for all 458 genes based on a unique partitioning of the genes using one or more information sources. The publications (P), Gene Ontology (G), gene expression (E), and combinations of each are shown. The standard error bars for the averaged p-value are shown for each clustering result and the traditional level of α = 0.05 is shown for reference.

References

    1. Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402(6761 Suppl):C47–C52. - PubMed
    1. Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(9):2981–2986. doi: 10.1073/pnas.0308661100. - DOI - PMC - PubMed
    1. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nature Genetics. 2002;31(4):370–377. - PubMed
    1. Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(21):12123–12128. doi: 10.1073/pnas.2032324100. - DOI - PMC - PubMed
    1. Coelho LP, Peng T, Murphy RF. Quantifying the distribution of probes between subcellular locations using unsupervised pattern unmixing. Bioinformatics. 2010;26(12):i7–i12. doi: 10.1093/bioinformatics/btq220. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources